Empathy Prediction

Supervised learning, dataset: Young people survey

Mirko Mantovani, 7 December 2018

CS 412 Machine Learning final Project

Notebook

The notebook containing all the analysis workflow that I have done is: workflow.ipynb. PCAtest.ipynb is an additional small notebook where I tried PCA that was not included in the end.

Running the program

To run the program from terminal just use the command:

python train.py

The train.py script executes the preprocessing of the data and as well as the evaluation on some basic models and the definition of a baseline, it then splits on train/dev/set, tunes the hyperparameters for the 2 models (XGBoost and Random Forest) constituting the final model ensemble, trains, saves the models and test set, and tests, producing the final accuracy

python test.py

The test.py script only retrieves the final models, tests and prints the accuracy (it's only the final part of the train.py script to make it easier and faster to test without having to go through train and hyperparameters tuning)

Modules and libraries

To run the train.py script you need the following python modules:

pandas
numpy
sklearn
scipy
pickle
xgboost

While some modules are very popular and easy to install, xgboost can be tricky to install at times, therefore, the output generated by the train.py on my machine is appended to this README, in case there were any problems in running it.

To run the test.py script you need the following python modules:

pickle
xgboost
numpy

Output of train.py

Mirkos-MacBook-Pro:Machine-Learning-empathy-prediction mirkomantovani$ python train.py

---------- DATA PREPROCESSING ----------

Dropping rows with Empathy = NaN
New dataset size:(1005, 150)
Missing values imputation with mode
Categorical values -> binary, ordinal, OHE

---------- BASELINE AND SIMPLE CLASSIFIERS ----------

Logistic Regression multiclass (1,2,3,4,5) and then converting to 0-1
Accuracy on Test 0.6417910447761194Accuracy on Train 0.8407960199004975
SVM OVO multiclass (1,2,3,4,5) and then converting to 0-1
Accuracy on Test 0.6019900497512438Accuracy on Train 0.9664179104477612
Linear SVM binary classifier
Accuracy on Test 0.6417910447761194
BASELINE: majority voting classifier
Accuracy: 0.6616915422885572

---------- COMPLEX MODELS, ensembles ----------

Random forest regressor and conversion to 0-1
Accuracy on Test 0.7064676616915423Accuracy on Train 0.9850746268656716
Random forest binary classifier 20-fold CV
Accuracy 0.702549019607843

---------- BUILDING FINAL MODEL ----------

Train/dev/set split: 80/20 and crossvalidation to tune Hyperparameters

---------- Parameters tuning and feature selection ----------

Parameters tuning Random Forest using crossvalidation and validationcurve
[Parallel(n_jobs=-1)]: Done 110 out of 110 | elapsed: 2.3s finished
Best n_estimators: 56, accuracy:0.7090169465531784
[Parallel(n_jobs=-1)]: Done 300 out of 300 | elapsed: 9.0s finished
Best max_depth: 7, accuracy:0.700328330206379
[Parallel(n_jobs=-1)]: Done 80 out of 80 | elapsed: 2.5s finished
Best max_features: 13, accuracy:0.7041344350679793

Feature selection, using SelectKBest, for XGboost (since Boosting does not work well with a lot of features, especially if they are correlated)

The most important 20 features that will be used for XGBoost: ['Judgment calls', 'Male', 'Life struggles', 'Psychology', 'Compassion to animals', 'Children', 'Latino', 'Fantasy/Fairy tales', 'Weight', 'Friends versus money', 'Cars', 'Theatre', 'Fake', 'PC', 'Reading', 'Spending on gadgets', 'Loss of interest', 'Borrowed stuff', 'Height', 'Foreign languages']
Parameters tuning XGBoost using RandomizedSearchCV Fitting 5 folds for each of 10 candidates, totalling 50 fits
[Parallel(n_jobs=1)]: Done 50 out of 50 | elapsed: 3.1s finished
{'gamma': 1, 'learning_rate': 0.6671984172800605, 'max_depth': 16, 'reg_alpha': 13} -0.28695280732032963

---------- FINAL MODEL ----------

The final model is an ensemble of Random forest and extreme gradient boosting, the idea for this is explained in the write-up and even better in the notebook
Training XGBoost binary classifier (hinge) on all train (80%)
Saving xgb model
Training Random Forest binary classifier on all train (80%)
Saving xgb model
Testing on test set (20%)
Accuracy XGBoost: 0.7711442786069652
Accuracy Random Forest: 0.7412935323383084

Final model (RF+XGB logical AND ensemble) accuracy : 0.7910447761194029

Output of test.py

Mirkos-MacBook-Pro:Machine-Learning-empathy-prediction mirkomantovani$ python test.py
Loading XGBoost model
Loading XGBoost x_test
Loading Random Forest classifier model
Loading Random Forest x_test
Loading y_test
Predicting
Accuracy XGBoost: 0.7711442786069652
Accuracy Random Forest: 0.7412935323383084