XGBoost

ML

by 30303 2024. 3. 22. 18:32

728x90

XGBoost(eXtreme Gradient Boosting)

XGBoost 란?

XGBoost는 eXtreme Gradient Boosting의 약자(극한 변화도 부스팅)

Boosting 기법을 이용하여 구현한 알고리즘은 Gradient Boosting Machine이 대표적임

- 이 알고리즘은 Computing 적으로 병렬 학습이 지원되도록 구현함

- Regression, Classification 문제를 모두 지원하며, 성능과 자원 효율에 장점이 있음

XGBoost의 장점

- GBM 대비 빠른 수행시간

- 병렬 처리로 학습, 분류 속도가 빠름

- 과적합을 방지하는 Penalty Term 존재

- 지도학습에서 뛰어난 학습 능력을 보임

- Missing Values(결측치)를 내부적으로 처리해줌

An optimized version of GBM enabling

- Cache awareness and out of core computing : 하드웨어 적인 부분

- XGBoost는 근사 값을 찾아 나가는 방법이기 때문에 약간의 정확도 손실이 존재함

- 하지만, 많은 양의 Data를 빠르게 처리할 수 있기 때문에 Big Data의 경우 매우 유용함

A Scalable Tree Boosting System

- Exact Greedy Algorithm for Split Finding

- 일반적으로 Tree Split Point를 찾을 때 사용하는 방법 (순도, Information Gain이 높은 Point)

- 장점: 항상 Optimal split point를 보장한다 (가능한 모든 Split point를 찾기 때문)

- 단점: Data가 커지면 커질수록 가능한 모든 Split point를 찾기 어려움 (불가능함, Memory Issue), 분산 환경에서 처리가 불가능함

XGBoost Split Finding Algorithm : Approximate Algorithm

- K번째 Feature의 값을 오름 차순으로 정렬했을 때 (왼쪽: Min, 오른쪽: Max)

- Buckets을 정하여 Dataset을 나누어 줌 (예시는 10개 Buckets)

- 각각의 Bucket 내에서 Split Point를 찾고 최종적으로 10개 Bucket 중 IG(information gain)가 가장 높은 Point를 Best Split Point로 지정함

Sparsity-Aware Split Finding

- Effient handing of missing data

- 현업, 현실 Data는 많은 Missing data가 가 존재하게 됨

- Missing Data가 아니더라도 0 (Zero) 값이 많이 나타남

- Category Feature to One-hot encoding 일 때도 0 값이 많이 나타나게 됨

- Solution : 학습할 때 Default direction을 정의함

- 새로운 Missing이나 Zero Data가 들어 왔을 때 학습 시 정의한 Default direction으로 보내버림

Parallelized Tree Building

- Tree를 학습시킬 때 Split Point를 찾기 위해 Feature 별로 값을 sorting 시킴

- Sorting 할 때 시간이 매우 오래 걸림 (Data가 커지면 커질수록 더 많이 걸림)

- 따라서, 처음에 Data를 Row wise로 두는게 아니라 Column wise로 정렬 시켜 놓음 (compressed column, CSC Format)

- 이렇게 했을 때 Data의 위치가 섞일 수 있지만 처음에 정렬 시켜 놓았을 때 Index를 가지고 있음

- One Time Sorting 시 그 다음 부터 Sorting 할 필요가 없음 → 시간 단축

XGBoost의 경우 Missing Value를 Model 자체 내에서 처리해주기 때문에 삭제하지 않아도 됨

[XGBoost Parameters]

- Package : https://xgboost.readthedocs.io/en/stable/

-

booster : Iteration 마다의 Model Run Type을 고를수 있음 (2가지)

- gbtree : tree-based models

- gblinear : linear models

-

silent : 학습하면서 running message를 프린트해줌 (Parameter 실험 시 안좋음)

- 0은 프린트 안해주고, 1은 프린트해줌

- nthread : 병렬처리 할때 core를 몇개 잡을 것인지

default로 잡을 수 있는 모든 core를 잡을 수 있도록 해줌

learning_rate : GBM에서 shrinking 하는 것과 같은 것

reg_lambda : L2 regularization term on weights (analogous to Ridge regression)

reg_alpha : L1 regularization term on weight (analogous to Lasso regression)

objective [default=reg:linear]

- This defines the loss function to be minimized. Mostly used values are:

- binary:logistic –logistic regression for binary classification, returns predicted probability (not class)

- multi:softmax –multiclass classification using the softmax objective, returns predicted class (not probabilities)

you also need to set an additional num_class (number of classes) parameter defining the number of unique classes

- multi:softprob –same as softmax, but returns predicted probability of each data point belonging to each class.

eval_metric [ default according to objective ]

- The metric to be used for validation data.

- The default values are rmse for regression and error for classification.

Typical values are:

- rmse – root mean square error

- mae – mean absolute error

- logloss – negative log-likelihood

- error – Binary classification error rate (0.5 threshold)

- merror – Multiclass classification error rate

- mlogloss – Multiclass logloss

- auc: Area under the curve

Hyperparameter tuning

- n_estimators, learning_rate, max_depth, reg_alpha

- XGBoost은 Hyperparam이 굉장히 많은 알고리즘 중에 하나임

- 위에 4가지만 잘 조정해도 좋은 결과를 얻을 수 있음

'ML' 카테고리의 다른 글

Anomaly detection - 3-Sigma rule& box plot (0)	2024.03.25
LightGBM (0)	2024.03.22
AdaBoost / Gradient Boosting Machine (0)	2024.03.22
Classification- Decision Tree / Random Forest Code (0)	2024.03.21
Classification- Random Forest (0)	2024.03.20

303

고정 헤더 영역

메뉴 레이어

메뉴 리스트

검색 레이어

검색 영역

상세 컨텐츠

본문 제목

본문

'ML' 카테고리의 다른 글

관련글 더보기

추가 정보

인기글

티스토리툴바