Performance Analysis of Neural Collaborative Filtering (NCF) Architecture

Infrastructure for Advanced Analytics and Machine Learning, Sommersemester 2020

Authors:

  • Bagrat Ter-Akopyan
  • Michał Filipiuk

Full description of the project can be found here:

The code used here was copied from NVIDIA repository.

Quick Start Guide

  1. Clone the repository.
git clone https://github.com/mkfilipiuk/AAML_Bagrat_Ter-Akopyan_Michal_Filipiuk
cd AAML_Bagrat_Ter-Akopyan_Michal_Filipiuk
  1. In case of using a Linux machine you can create an Anaconda virtual environment from the environment.yml:
conda env create -f environment.yml

Otherwise:

conda create -n iaaml_ncf python=3.6

and install missing dependencies manually from the environment.yml

  1. Activate the Anaconda virtual environment:
conda activate iaaml_ncf
  1. Download and preprocess the data.

Preprocessing consists of downloading the data, filtering out users that have less than 20 ratings (by default), sorting the data and dropping the duplicates. The preprocessed train and test data is then saved in PyTorch binary format to be loaded just before training.

No data augmentation techniques are used.

Download the data from https://grouplens.org/datasets/movielens/20m/ and put it into the ./data directory.

To preprocess the ML-20m dataset you can run:

./prepare_dataset.sh

Note: This command will return immediately without downloading anything if the data is already present in the ./data directory.

This will store the preprocessed training and evaluation data in the ./data directory so that it can be later used to train the model (by passing the appropriate --data argument to the ncf.py script).

  1. Start mlflow
mlflow server
  1. Start training.
python -m torch.distributed.launch --nproc_per_node=8 --use_env ncf.py --data ./data/cache/ml-20m --checkpoint_dir ./data/checkpoints/

This will result in a checkpoint file being written to ./data/checkpoints/model.pth.

  1. To run the whole scripts for reproducing the complete results:
jupyter notebook

open the training.ipynb and run the cells

  1. To reproduce the plots:
  • open the create_plots.ipynb
  • run the cells