Performance Analysis of Neural Collaborative Filtering (NCF) Architecture

Infrastructure for Advanced Analytics and Machine Learning, Sommersemester 2020

Authors:

Bagrat Ter-Akopyan
Michał Filipiuk

Full description of the project can be found here:

The code used here was copied from NVIDIA repository.

Quick Start Guide

Clone the repository.

git clone https://github.com/mkfilipiuk/AAML_Bagrat_Ter-Akopyan_Michal_Filipiuk
cd AAML_Bagrat_Ter-Akopyan_Michal_Filipiuk

In case of using a Linux machine you can create an Anaconda virtual environment from the environment.yml:

conda env create -f environment.yml

Otherwise:

conda create -n iaaml_ncf python=3.6

and install missing dependencies manually from the environment.yml

Activate the Anaconda virtual environment:

conda activate iaaml_ncf

Download and preprocess the data.

Preprocessing consists of downloading the data, filtering out users that have less than 20 ratings (by default), sorting the data and dropping the duplicates. The preprocessed train and test data is then saved in PyTorch binary format to be loaded just before training.

No data augmentation techniques are used.

Download the data from https://grouplens.org/datasets/movielens/20m/ and put it into the ./data directory.

To preprocess the ML-20m dataset you can run:

./prepare_dataset.sh

Note: This command will return immediately without downloading anything if the data is already present in the ./data directory.

This will store the preprocessed training and evaluation data in the ./data directory so that it can be later used to train the model (by passing the appropriate --data argument to the ncf.py script).

Start mlflow

mlflow server

Start training.

python -m torch.distributed.launch --nproc_per_node=8 --use_env ncf.py --data ./data/cache/ml-20m --checkpoint_dir ./data/checkpoints/

This will result in a checkpoint file being written to ./data/checkpoints/model.pth.

To run the whole scripts for reproducing the complete results:

jupyter notebook

open the training.ipynb and run the cells

To reproduce the plots:

open the create_plots.ipynb
run the cells