Machine Learning Master Code

Feature Engineering

정규화 옵션 : log1p, z-score, min-max(mlp)

결측값 처리 옵션 : 0, -9999, NA

WoE Features if there are too many splits

Target Mean Encoding

2 or 3 Interaction Features

Feature 6: Clustering features of original dataset

Feature 7: Number of non-zeros elements in each row

post-processing

Feature Selection

https://www.kaggle.com/ogrellier/feature-scoring-vs-zeros

For each LGB model, I choose top 50 features and random choose 50 features from left features

Feature Extraction

Modeling

Light GBM, CatBoost, XGBoost, Extra Trees

Tuning : Cartesian + Random Grid Search

Ensemble : 2-layer or 3-layer

train데이터셋의 target range를 기준으로 예측값 cut off

light gbm tuning guideline : https://github.com/Microsoft/LightGBM/issues/695

tuning options guideline: https://sites.google.com/view/lauraepp/parameters

3 LGBM models with the same parameters, but different seeds inside of LGBM (UPD: and different seeds in KFold splitting, which was even more important)

binary_logloss