ml-project-2
EPFL's Pattern Classification and Machine Learning second course project
Team members
- Jade Copet
- Merlin Nimier-David
The project was designed by Prof. Emtiyaz & TAs.
Project structure
This project contained two tasks: people detection in images, and a song recommender system from listening counts data.
analysis
: simple data exploratory analysis scripts we used to get to know the datasets better.src
:detection
: code for the people detection dataset. We experimented with Gaussian Processes, Neural Networks, PCA, SVM and Random Forestsrecommendation
: code for the song recommendation dataset. We experimented with various feature extractions, ALS-WR, linear regression, K-means clustering, Gaussian Mixture Model clustering, Top-N recommendation and the Pearson similarity measure.
toolbox
: place the dependencies there. Our code relies on the DeepLearn toolbox, Piotr toolbox, and the VBGM script (see Tools section).report
: project report (written in LaTeX). Contains references to related papers which were helpful for the project.data
andresults
: input and output data (provided as Matlab.mat
files).test
: simple test scripts which were provided for us to check the output format of our predictions.
Project's TODO
ML methods
- Generic k-fold Cross Validation
- Support Vector Machine
- Gaussian process and several kernels
- K-means clustering
- Gaussian Mixture Model and EM algo
- Principal Components Analysis (as a low-rank approximation) using alternating least squares
- Neural Networks (implementation from the DeepLearn toolbox)
- Generic ML method comparison function (for each method, plot achieved test error & stability with a boxplot)
Dataset pre-processing
- Basic data characteristics (dimensionality, repartition, correlation, ...)
- Try obtaining helpful visualizations
- Dimensionality reduction with PCA
Person detection dataset
- Implement the relevant error measures
- Try feature transformations (basis expansion)
- Get a baseline error value
- Implement PCA
- Implement Logistic Regression
- Experiment with the Neural Network's hyperparameters (number of layers, activation functions, dropout...)
- Experiment with Gaussian Processes (check provided toolbox)
- Experiment with SVM (check provided toolbox)
- Experiment with Random Forest
- Implement kCV fastROC
- Generate feature selection plots (code to replicate for different methods)
- Plot ROC Curves of different models for comparison
- Maximize the train and test avgTPR
- Check the stability of the results with k-CV
Song recommendation dataset
Recall we must achieve both weak (new ratings for existing users) and strong (entirely new users) prediction.
- Implement the relevant error measures
- Implement error diagnostics (which kind of counts do we make the most error on?)
- Train / test split (particular for weak and strong prediction)
- [X] Feature engineering: implement derived variables
- [X] Get a baseline error value
- Implement Top-N recommendation (cluster with Pearson similarity measure)
- [X] Experiment with K-Means
- Implement the simple Slope One method
- Experiment with Gaussian Mixture Models (soft clustering)
- Cluster the tail items (head / tail cutoff point to be chosen carefully)
- Try clustering in reduced-dimensionality space
- Implement a hybrid head / tail predictor (e.g. Each Item for head, Top-K for tail)
- Determine if using the social graph helps weak prediction (then, we will be able to know if we can use it for strong prediction as well)
- Use the social network and generic artist informations for strong prediction
- Generate feature selection plots
- Minimize the train and test error
- Check the stability of the results with random train / test splits
- Use the artists name to output fun facts
Predictions
-
songPred.mat
contains the two matricesYtest_weak_pred
(size 1774x15082) andYtest_strong_pred
(size 93x15082) -
personPred.mat
contains a vector 'Ytest_score' (8743x1) with the prediction score for each test sample
Report
- Describe and discuss the methods used and show that we understand their inner working and the influence of each hyperparameter (especially for methods we did not implement ourselves)
- Produce figures for the detection dataset
- Report work done for the detection dataset and the corresponding results
- Produce figures for the recommendation dataset
- Report work done for the recommendation dataset and the corresponding results
- Double-check all figures for labels (on each axis and for the figure itself)
- Clear conclusion and analysis of the results for each dataset
- Include complete details about each algorithm (initialization values, lambda values, number of folds, number of trials, etc)
- What worked and what did not? Why do you think are the reasons behind that?
- Why did you choose the method that you chose?