This repository contains an implementation of the Collaborative Deep Learning for Recommender Systems by Wang, Wang, and Yeung. In contrast to the experiment presented in the paper, here we explore using BERT to embed the content rather than the bag-of-words representation.
The default hyperparameters---the best ones we found---achieve 25.3% recall@300 for the bag-of-words embedding. The same exact hyperparameters achieve 31.5% recall@300 by simply switching to use the BERT embedding.
The project was developed using Python 3.8. After setting up your Python 3.8
environment, you can install the requirements with pip
:
pip install -r requirements.txt
The repository contains only raw data. To prepare the data in the format the source code expects, you can run
make citeulike-a
which will read from data/raw/
and write to data/processed/
. Note that
make citeulike-a
may take a long time to run as it will download a BERT
pre-trained model and then embed every document with that model. In
computing the embeddings, having a GPU helps signficiantly.
The file train.py
will train the CDL from the data files created in the
previous step. To train the model and compute recall@300, you can run
python train.py -v
The -v
flag toggles "verbose" mode. There are many more command-line flags
to customize behavior; almost every hyperparameter can be set without changing
any code. By default, the BERT embedding will be used; to use the bag-of-words
embedding, run train.py
with the flag --embedding bow
.
The training script will output the recall@300 when finished. It will also
save the model to disk (by default in a file model.pt
, but configurable with
the --out
flag). The infer.py
program can be run to (re)compute the recall.