/ml-project-2

EPFL's Pattern Classification and Machine Learning second course project

Primary LanguageMatlab

ml-project-2

EPFL's Pattern Classification and Machine Learning second course project

Team members

  • Jade Copet
  • Merlin Nimier-David

The project was designed by Prof. Emtiyaz & TAs.

Project structure

This project contained two tasks: people detection in images, and a song recommender system from listening counts data.

  • analysis: simple data exploratory analysis scripts we used to get to know the datasets better.
  • src:
    • detection: code for the people detection dataset. We experimented with Gaussian Processes, Neural Networks, PCA, SVM and Random Forests
    • recommendation: code for the song recommendation dataset. We experimented with various feature extractions, ALS-WR, linear regression, K-means clustering, Gaussian Mixture Model clustering, Top-N recommendation and the Pearson similarity measure.
  • toolbox: place the dependencies there. Our code relies on the DeepLearn toolbox, Piotr toolbox, and the VBGM script (see Tools section).
  • report: project report (written in LaTeX). Contains references to related papers which were helpful for the project.
  • data and results: input and output data (provided as Matlab .mat files).
  • test: simple test scripts which were provided for us to check the output format of our predictions.

Project's TODO

ML methods

  • Generic k-fold Cross Validation
  • Support Vector Machine
  • Gaussian process and several kernels
  • K-means clustering
  • Gaussian Mixture Model and EM algo
  • Principal Components Analysis (as a low-rank approximation) using alternating least squares
  • Neural Networks (implementation from the DeepLearn toolbox)
  • Generic ML method comparison function (for each method, plot achieved test error & stability with a boxplot)

Dataset pre-processing

  • Basic data characteristics (dimensionality, repartition, correlation, ...)
  • Try obtaining helpful visualizations
  • Dimensionality reduction with PCA

Person detection dataset

  • Implement the relevant error measures
  • Try feature transformations (basis expansion)
  • Get a baseline error value
  • Implement PCA
  • Implement Logistic Regression
  • Experiment with the Neural Network's hyperparameters (number of layers, activation functions, dropout...)
  • Experiment with Gaussian Processes (check provided toolbox)
  • Experiment with SVM (check provided toolbox)
  • Experiment with Random Forest
  • Implement kCV fastROC
  • Generate feature selection plots (code to replicate for different methods)
  • Plot ROC Curves of different models for comparison
  • Maximize the train and test avgTPR
  • Check the stability of the results with k-CV

Song recommendation dataset

Recall we must achieve both weak (new ratings for existing users) and strong (entirely new users) prediction.

  • Implement the relevant error measures
  • Implement error diagnostics (which kind of counts do we make the most error on?)
  • Train / test split (particular for weak and strong prediction)
  • [X] Feature engineering: implement derived variables
  • [X] Get a baseline error value
  • Implement Top-N recommendation (cluster with Pearson similarity measure)
  • [X] Experiment with K-Means
  • Implement the simple Slope One method
  • Experiment with Gaussian Mixture Models (soft clustering)
  • Cluster the tail items (head / tail cutoff point to be chosen carefully)
  • Try clustering in reduced-dimensionality space
  • Implement a hybrid head / tail predictor (e.g. Each Item for head, Top-K for tail)
  • Determine if using the social graph helps weak prediction (then, we will be able to know if we can use it for strong prediction as well)
  • Use the social network and generic artist informations for strong prediction
  • Generate feature selection plots
  • Minimize the train and test error
  • Check the stability of the results with random train / test splits
  • Use the artists name to output fun facts

Predictions

  • songPred.mat contains the two matrices Ytest_weak_pred (size 1774x15082) and Ytest_strong_pred (size 93x15082)
  • personPred.mat contains a vector 'Ytest_score' (8743x1) with the prediction score for each test sample

Report

  • Describe and discuss the methods used and show that we understand their inner working and the influence of each hyperparameter (especially for methods we did not implement ourselves)
  • Produce figures for the detection dataset
  • Report work done for the detection dataset and the corresponding results
  • Produce figures for the recommendation dataset
  • Report work done for the recommendation dataset and the corresponding results
  • Double-check all figures for labels (on each axis and for the figure itself)
  • Clear conclusion and analysis of the results for each dataset
  • Include complete details about each algorithm (initialization values, lambda values, number of folds, number of trials, etc)
  • What worked and what did not? Why do you think are the reasons behind that?
  • Why did you choose the method that you chose?

Tools