Anomaly detection- Isolation Forest

ML

by 30303 2024. 3. 25. 10:23

728x90

Isolation Forest

기본적으로 의사결정나무(Decision Tree) 형태로 표현해 정상 값을 분리하기 위해서는 의사결정나무를 깊숙하게 타고 내려 가야 하고,

반대로 이상치인 경우 의사결정나무 상단부에서 분리할 수 있다는 것을 이용한 기법임

정상보다 이상치의 depth가 낮을 것이라는 가정

Random Forest 처럼 Decision Tree를 앙상블 하듯, Isolation Forest도 마찬가지로 Decision Tree를 앙상블함

Sub-sampling : 비복원 추출하여 Tree에 들어갈 Dataset을 준비함

: Random Forest의 경우 복원 추출을 함 (약 37% 데이터가 뽑히지 않기 때문에 Data Noise에 강건해질 수 있었음)

Feature Randomly Selection : 변수를 랜덤하게 선택함

Split point 설정

선택된 변수들의 값을 랜덤하게 분기하기 위해 해당 변수의 최소/최대 값 사이에 정의된 Uniform 분포에서 샘플링 진행

사용하는 데이터의 크기 만큼 나무의 최대 깊이가 𝒍𝒐𝒈𝟐𝒏만큼 정해지므로 (데이터가 256개라면 최대 깊이 7)

최대 깊이에 도달할 때까지 지속적으로 Split 진행 (모든 데이터 고립 → Fully overfitting)

각 데이터 마다의 Scoring 계산

- 𝒉(𝒙) : 해당 관측치의 경로 길이 (Root Node로 부터의 거리, Depth)

- 𝑬(𝒉(𝒙)) : 모든 iTree에서 해당 관측치에 대한 경로 길이 평균

예를 들어, Tree가 100개라면 그 중 해당 데이터가 있는 Tree의 𝒉(𝒙) 계산 후 평균

- 𝒄(𝒙) : iTree의 평균 경로 길이 (Tree마다 Score 값을 Normalize 하기 위함)

- Score는 0 ~ 1 사이 값을 가지며 1에 가까울 수록 이상치 일 확률이 높고 0.5 이하면 정상 데이터로 판별

- 𝑬 𝒉 𝒙 = 𝒄(𝒏) 인 경우 Score는 0.5

- 𝑬 𝒉 𝒙 = 𝟎 인 경우 Score는 1

- 𝑬 𝒉 𝒙 = 최대경로 인 경우 Score 0

예시: Tree Height limit

- Tree Depth The depth of a node is the number of edges from the node to the tree's root node.

- Tree height: The height of a node is the number of edges on the longest path from the node to a leaf.

[Isolation Forest Parameter]

package : https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.IsolationForest.html

n_estimators : 원하는 기본 estimators 수, default=100

max_samples : 하나의 estimator에 들어가는 sample 수(int or float)

- If int, then draw max_samples samples.

- If float, then draw max_samples * X.shape[0] samples.

- If “auto”, then max_samples=min(256, n_samples).

- default='auto'

contamination : 데이터 세트 내 이상치 개수 비율('auto' or float)

- default='auto'

max_features : estimator의 최대 columns 수(int or float), default=1.0

- If int, then draw max_features features.

- If float, then draw max(1, int(max_features * n_features_in_)) features.

- default=1.0

bootstrap : 데이터 중복(bootstrap)할 것인지 여부(boolean),

- default=False

- 이상치가 안 뽑힐 수 있기 때문에 추천하지 않음

'ML' 카테고리의 다른 글

Clustering- DBSCAN/ HDBCAN (0)	2024.03.25
Clustering- K-means (0)	2024.03.25
Anomaly detection-LOF(Local Outlier Factor) (0)	2024.03.25
Anomaly detection - 3-Sigma rule& box plot (0)	2024.03.25
LightGBM (0)	2024.03.22

303

고정 헤더 영역

메뉴 레이어

메뉴 리스트

검색 레이어

검색 영역

상세 컨텐츠

본문 제목

본문

'ML' 카테고리의 다른 글

관련글 더보기

추가 정보

인기글

티스토리툴바