Past Kaggle solutions from winners

This is a collection of solutions for Kaggle competitions and a summary of useful skills that I have learned from those solutions. I divided the contents into two sections, the first section is my summary of those winner solutions and the second section is the collection of those solutions and some useful tutorials.

Section one: my summary of those winner solutions

General workflow

Computing environment setup
Exploratory data analysis
A quick benchmark run
Data preprocessing
Feature enginnering
Feature selection
Model evaluation and selection
Parameter turning
Model ensembling
Prediction and submission

Computing environment setup

Use Google Cloud or Amazon AWS as computing platform

Exploratory data analysis

Calculate summary statistics.
-. total number of samples and variables
-. number of missing values and zeros
-. mean, sd, min, max values for continuous variables
-. number of unqiue values/categories for categorial and ordinal variables
Plot

A quick benchmark run

Use Random Forst (100 trees) without any feature enginnering to generate a quick submission. This submission can be used as a benchmark for further improvement. Plot the importance of the feaures to get a sense what are the most important features for prediction.
Train a simple Random Forest model and plot the confusion matrix for classfication or true-prediction-value-scatter-plot for regression. Find out where most the prediction errors come from. For example, it may come from certain categories. Need to split original training data into training and testing data.

Feature enginnering

General transformation: multiply, divide, sum, subtract, log, min, max, mean, std
If data have distance or length variables, several new features can be generated by multiplying (area or volume), dividing (ratio between two length), substrating (difference between two length) or summming (total length or distance) those variables or a subset of those variables.
Date variable: (1) Extract day, month, quarter, year, weekend, weekday, holiday etc. as new features (2) Calculate the length between two dates

Feature selection

Use the feature importance generated by Random Forest or XGBoost to rank features. Iteratively remove the least importance features and fit the model until the accuracy of the prediction to decrease.
Use XGBoost for feature selection: (1) Keep the number of trees small (<20 trees) (2) Keep the max depth of the tree small (<7) (3) Iteratively run the feature importance analysis by removing the most(or least) important features.
Decision trees (XGBoost, Random Forest) are not affected multi-collinearity.

Model evaluation and selection

Popular models: Random Forst, Extra Trees, XGBoost

Parameter turning

Use grid search to fine tune parameters.
For Random Forest and Extra Trees, two important parameters that can be tuned: number of trees and the number of randomly selected features to seek split.
XGBoost: (1) small eta -> small shinkage -> less overfitting -> slow convergence -> need more trees

rx319/Kaggle_Solutions