LightGBM

ML

by 30303 2024. 3. 22. 19:28

728x90

LightGBM의 Motivation

- 전통적으로 GBM계열의 알고리즘은 모든 Feature에 대해, 모든 Data에 대해 Scan하여 Information Gain을 획득

- 사용하는 Feature와 Data를 줄임

- Gradient based One Side Sampling (GOSS) 적용

- Information Gain을 계산할 때 각각의 Data는 다 다른 Gradient(기울기,중요도)를 갖고 있음

- 그렇다고 하면 Gradient가 큰 Data는 Keep 하고 Gradient가 낮은 Data는 Randomly Drop을 수행

- Exclusive Feature Bundling (EFB)

- 대게 0(Zero) 값을 동시에 가지는 Data는 거의 없음 (One-hot encoding)

- 따라서, 독립적인(Exclusive) Feature는 하나로 Bundling 함

Split point 찾기

Gradient-based One-sided Sampling (GOSS)

- 각 Data 마다의 Gradient를 구하고 Sorting 함

- Gradient가 높은 것은 계속 Keep 하고, Gradient가 낮은 것은 Randomly Drop을 수행

- (1-a)/b < 1 할 때, 효과가 극대화 되게 됨 (권장 사항) case1 a=0.1, b=0.9 vs case2 a=0.05, b=0.5

LightGBM의 경우 Missing Value를 Model 자체 내에서 처리해주기 때문에 삭제하지 않아도 됨

- Big Data를 빠르게 학습함

- 논문에서는 데이터 10,000개 이상일 때 사용 권장, overfitting 문제

[LightGBM Parameters]

- Package : https://lightgbm.readthedocs.io/en/latest/Python-Intro.html

- learning_rate : GBM에서 shrinking 하는 것과 같은 것

- reg_lambda : L2 regularization term on weights (analogous to Ridge regression)

- reg_alpha : L1 regularization term on weight (analogous to Lasso regression)

- objective

default = regression, type = enum, options: regression, regression_l1, huber, fair, poisson, quantile, mape, gamma, tweedie, binary, multiclass, multiclassova, cross_entropy, cross_entropy_lambda, lambdarank, rank_xendcg, aliases: objective_type, app, application, loss

- eval_metric [ default according to objective ]

- The metric to be used for validation data.

- The default values are rmse for regression and error for classification.

- Typical values are:

- rmse – root mean square error

- mae – mean absolute error

- logloss – negative log-likelihood

- error – Binary classification error rate (0.5 threshold)

- merror – Multiclass classification error rate

- mlogloss – Multiclass logloss

- auc: Area under the curve

- Hyperparameter tuning

- n_estimators, learning_rate, max_depth, reg_alpha

- LightGBM은 Hyperparam이 굉장히 많은 알고리즘 중에 하나임

- 위에 4가지만 잘 조정해도 좋은 결과를 얻을 수 있음

'ML' 카테고리의 다른 글

Anomaly detection-LOF(Local Outlier Factor) (0)	2024.03.25
Anomaly detection - 3-Sigma rule& box plot (0)	2024.03.25
XGBoost (0)	2024.03.22
AdaBoost / Gradient Boosting Machine (0)	2024.03.22
Classification- Decision Tree / Random Forest Code (0)	2024.03.21