This is a collection of solutions for Kaggle competitions and a summary of useful skills that I have learned from those solutions. I divided the contents into two sections, the first section is my summary of those winner solutions and the second section is the collection of those solutions and some useful tutorials.
- Computing environment setup
- Exploratory data analysis
- A quick benchmark run
- Data preprocessing
- Feature enginnering
- Feature selection
- Model evaluation and selection
- Parameter turning
- Model ensembling
- Prediction and submission
- Use Google Cloud or Amazon AWS as computing platform
- Calculate summary statistics.
-. total number of samples and variables
-. number of missing values and zeros
-. mean, sd, min, max values for continuous variables
-. number of unqiue values/categories for categorial and ordinal variables - Plot
- Use Random Forst (100 trees) without any feature enginnering to generate a quick submission. This submission can be used as a benchmark for further improvement. Plot the importance of the feaures to get a sense what are the most important features for prediction.
- Train a simple Random Forest model and plot the confusion matrix for classfication or true-prediction-value-scatter-plot for regression. Find out where most the prediction errors come from. For example, it may come from certain categories. Need to split original training data into training and testing data.
- General transformation: multiply, divide, sum, subtract, log, min, max, mean, std
- If data have distance or length variables, several new features can be generated by multiplying (area or volume), dividing (ratio between two length), substrating (difference between two length) or summming (total length or distance) those variables or a subset of those variables.
- Date variable: (1) Extract day, month, quarter, year, weekend, weekday, holiday etc. as new features (2) Calculate the length between two dates
- Use the feature importance generated by Random Forest or XGBoost to rank features. Iteratively remove the least importance features and fit the model until the accuracy of the prediction to decrease.
- Use XGBoost for feature selection: (1) Keep the number of trees small (<20 trees) (2) Keep the max depth of the tree small (<7) (3) Iteratively run the feature importance analysis by removing the most(or least) important features.
- Decision trees (XGBoost, Random Forest) are not affected multi-collinearity.
- Popular models: Random Forst, Extra Trees, XGBoost
- Use grid search to fine tune parameters.
- For Random Forest and Extra Trees, two important parameters that can be tuned: number of trees and the number of randomly selected features to seek split.
- XGBoost: (1) small eta -> small shinkage -> less overfitting -> slow convergence -> need more trees