ml-project-1

EPFL's Pattern Classification and Machine Learning first course project

Team members

The project was designed by Prof. Emtiyaz & TAs. Some ML methods and helper functions were provided by the teaching team.

We provide the full project folder. The most relevant files are located in:

src: the machine learning methods used to train model parameters
results: the CSV outputs
report: the report PDF and LaTeX source
exploration: preliminary data analysis scripts developed to get ahold of the datasets
regression and classification: dataset-specific scripts used to optimize expected test error (i.e. train the best model).

Try ridge regression on several seeds for different degrees and plot + check stability
Select a few degrees and do cross validation. (Selected degree 4 and output cool boxplots)
Try removing more features + compare stability with different methods and do cool boxplots. (Didn't seem to help)
Try increasing seeds number

beta = leastSquaresGD(y,tX,alpha): Least squares using gradient descent (alpha is the step-size)
beta = leastSquares(y,tX): Least squares using normal equations
beta = ridgeRegression(y,tX, lambda): Ridge regression using normal equations (lambda is the regularization coefficient)
beta = logisticRegression(y,tX,alpha): Logistic regression using gradient descent or Newton's method (alpha is the step size, in case of gradient descent)
beta = penLogisticRegression(y,tX,alpha,lambda): Penalized logistic regression using gradient descent or Newton's method (alpha is the step size for gradient descent, lambda is the regularization parameter)
Implement cross-validation for ridge regression
Implement cross-validation for penalized logistic regression
Implement generic cross-validation to estimate test and train error for each method

predictions_regression.csv: Each row contains prediction yhatn for a data example in the test set
predictions_classification.csv: Each row contains probability p(y=1|data) for a data example in the test set
test_errors_regression.csv: Report RMSE for methods "leastSquaresGD", leastSquares", "ridgeRegression"
test_errors_classification.csv: Report RMSE, 0-1 loss and log-loss for methods "logisticRegression", "penLogisticRegression"

Produce figures for the regression dataset
Report work done for the regression dataset and the corresponding results
Produce figures for the classification dataset
Double-check all figures for labels (on each axis and for the figure itself)
Clear conclusion and analysis of the results for each dataset
Include complete details about each algorithm (lambda values, number of folds, number of trials, etc)
What worked and what did not? Why do you think are the reasons behind that?
Why did you choose the method that you chose?