/kaggle-competitions

House Prices Prediction and Credit Default Risk Prediction competitions. Advanced decision tree-based regression and classification models are used.

Primary LanguageJupyter NotebookGNU General Public License v3.0GPL-3.0

Kaggle Competitions

License

House Prices Prediction and Credit Default Risk Prediction competitions.

https://www.kaggle.com/c/house-prices-advanced-regression-techniques

https://www.kaggle.com/c/home-credit-default-risk

In both, advanced decision tree-based regression and classification models are used.

In House Prices Prediction, performance evaluation is based on RMSLE (Root Mean Squared Logarithmic Error), while in Credit Default Risk Prediction, it is based on AUROC (Area Under Receiver Operating Characteristic).

In House Prices Prediction, I ranked 816/5011, with an error of 0.12549, compared to the best one of 0.00000.

Screenshot 2022-01-24 115000

In Credit Default Risk Prediction, I scored 0.73610, compared to the best score of 0.81724. Ranking was unavailable.

Screenshot 2022-01-26 220705

My submissions can be accessed from the submissions folder.

Problem Description

The problems are detailed well in the Kaggle links provided above.

Solution Approach

Open In Colab

After Feature engineering, the following regression models are tested:

  • Ridge
  • BaggingRegressor
    • n_estimators=50
  • RandomForestRegressor
    • n_estimators=50
  • XGBRegressor
    • max_depth=5
    • objective='reg:squarederror'
  • LGBMRegressor
  • VotingRegressor
    • estimators=[ridge, bagging, random_forest, xgb, lgbm]
    • n_jobs=-1
  • StackingRegressor
    • estimators=[ridge, bagging, random_forest, xgb, lgbm]
    • final_estimator=Ridge
    • n_jobs=-1

Hyperparameters:

  • train_test_split(test_size=0.2, random_state=0)
  • kfold = KFold(n_splits=5, shuffle=True, random_state=0)
  • cross_val_score(cv=kfold)

VotingRegressor is the best performing, with the best combined Validation R2 score, RMSLE and Cross validation R2 mean score.

Open In Colab

After Feature engineering, the following classification models are tested:

  • XGBClassifier
    • tree_method='gpu_hist'
    • gpu_id=0
  • LGBMClassifier
    • device='gpu'
  • RandomForestClassifier
    • n_estimators=50
  • StackingClassifier
    • estimators=[xgb, lgbm, random_forest]
    • final_estimator=LGBMClassifier
    • n_jobs=-1

Hyperparameter: train_test_split(test_size=0.2, random_state=42)

GPU is leveraged. Classification requires more computation power.

LGBMClassifier is the best performing, with the maximum Validation AUROC score.