Machine Learning Master Code

Feature Engineering

  • 정규화 옵션 : log1p, z-score, min-max(mlp)
  • 결측값 처리 옵션 : 0, -9999, NA
  • WoE Features if there are too many splits
  • Target Mean Encoding
  • 2 or 3 Interaction Features
  • Feature 6: Clustering features of original dataset
  • Feature 7: Number of non-zeros elements in each row
  • post-processing

  • Feature Selection

  • https://www.kaggle.com/ogrellier/feature-scoring-vs-zeros
  • For each LGB model, I choose top 50 features and random choose 50 features from left features

  • Feature Extraction

  • ..

  • Modeling

  • Light GBM, CatBoost, XGBoost, Extra Trees
  • Tuning : Cartesian + Random Grid Search
  • Ensemble : 2-layer or 3-layer
  • train데이터셋의 target range를 기준으로 예측값 cut off
  • light gbm tuning guideline : https://github.com/Microsoft/LightGBM/issues/695
  • tuning options guideline: https://sites.google.com/view/lauraepp/parameters
  • 3 LGBM models with the same parameters, but different seeds inside of LGBM (UPD: and different seeds in KFold splitting, which was even more important)
  • binary_logloss