CDM Group Project

The purpose of this project is to generate an ML model able to infer the presence, or absence, of the Higgs Boson starting from CERN measurements. The data were obtained on the Kaggle competition platform ( https://www.kaggle.com/c/epfml18-higgs ).

In this README you will find general information about the methods, and a more detailed documentation is given within the functions.

Getting Started

The provided code was tested with Python 3.6.5. The following libraries are used within the script:

Computational:

numpy (as np)

Graphical:

seaborn (as sns)
matplotlib (as plt)

Prerequisites

The folder structure has to be the following:

.
├── Data                    # Data files, in .csv
    ├── train.csv
    └── test.csv
├── src                     # Source files
└── README.md

All the scripts are in src, where in run.py you can find the code that generates our prediction.

Implementation details

run.py

This script produces a csv file containing the predictions Kaggle_CDM_submission.csv. The following are executed:

a) loading the data

b) data processing:

    - applying the log transformations with translation
    - impute the missing values with the median
    - normalize the data
    - split the variable `num_jet` into 4 categorical variables

b) polynomial extension of degree 4 of the data

c) interactions between the categorical variables and the continuous features

d) training a Ridge regression model, using cross_validation to determine the hyper-parameter lambda

e) training the Ridge regression model on the whole training data set with the determined lambda to obtain the weight vector w

f) compute predictions and create the .csv file

The data preprocessing applies log transformation to a specific set of features after translating some of them. Then it imputes the mean for the missing values and take out the phi and eta features out.

Implementation of class methods.

For the sake of automation we added a pred keyword argument (kwarg) to all our model functions. It is False by default, and if set to True the function returns as first output a pointer on the function to use in order to get the predictions for that model.

All functions using the gradient_descent algorithm have, in addition, the two kwargs printing, all_step, which are False by default. If printing=True, then at all GD steps you will see in the shell the actual mse value and the value of the first two parameters of w. If all_step=True, then the function returns all the computed w-s and errors (by default they are not stored and only the last value is given).

The following functions were implemented:

Function	Arguments
`least_squares_GD`	`y, tx, initial_w[, max_iters, gamma, args, *kwargs]`
`least_squares_SGD`	`y, tx, initial_w[, batch_size, max_iters, gamma, args, *kwargs]`
`least_squares`	`y, tx[, **kwargs]`
`ridge_regression`	`y, tx, lambda_[, **kwargs]`
`logistic_regression`	`y, x, [w, max_iters, gamma, **kwars]`
`reg_logistic_regression`	`y, x, lambda_, [initial_w, max_iters, gamma, **kwargs]`

The default values were chosen in order to get convergence on the GD algorithm.

ATTENTION:

Since our goal is to find a classification model, we have that all functions compute the error vector err as categorical (i.e if y_hat is the vector of estimated categories and y the true categories, j-th coordinate of err will be err[j]=1{y[j]=y_hat[j]}, where 1 is the indicator function.). Furthermore, the loss value returned is the misclassification ratio (i.e. the number of wrong predictions over the total number of predictions).

If one desires to implement our functions for different tasks, it is enough to set the two global functions err_f and loss_f to the desired ones.

Possible `loss_f`		Possible `err_f`
`calculate_mae`	MAE	`error`
`calculate_mse`	MSE	`category_error`
`calculate_rmse`	RMSE

They can be set as follows:

err_f = error #For continuous estimation.
loss_f = calculate_mse #For mean squared error loss.

Notes on `cross_validation` and `multi_cross_validation`

These are the two main functions inplemented in order to choose our model, and in particular to get an estimation of the prediction error.

cross_validation(y, tx, k_fold, method, *args_method[, k_indices, seed]) compute the k-fold cross validation for the estimation of y using a the method-function stored (as pointer) in the argument method. The arguments necessary for the method are to be passed freely after method. It returns predictor, w, loss_tr, loss_te, which are, in order, the predicting function, the mean of the trained weights, the mean of the train error and the estimate test error.
multi_cross_validation(y, x, k_fold[, transformations=[[id, []]], methods=[[least_squares, []]], seed=1, only_best=True]) Perform automatically the cross validation on all the combinations of transformations in the transformations list (their parameters have to be passed as a list coupled with the transformation) and methods with changing parameters in the methods list (the coupled list have in this case to be a list of the tuples of parameters combinations to test.) It then plots the estimated losses (both on train and test) and outputs predictor, weight, losses_tr, losses_te, transformations_list, methods_list. If only_best=True, those are the variables corresponding to the lowest test-error estimate, otherwise they contain the variables computed at each step. An implementation example can be found in the documentation.

Authors

William Cappelletti
Charles Dufour
Marie Sadler

dufourc1/ML_CS433_project1