This repository includes a sample implementation of Perceiver-IO recommender for a news recommendation task on MIND dataset, along with NAML and NRMS baseline implementations.
This code can be run from Docker on Linux environment. We confirmed that it runs under below environment.
- Linux (Ubuntu 20.04 LTS)
- Docker: version 20.10.6, build 370c289
- docker-compose: 1.29.1, build c34c88b2
- nvidia-container-toolkit: 1.5.1-1 amd64
- GPU: NVIDIA GeForce RTX 2080 Ti
git clone --recursive https://github.com/stockmarkteam/perceiver_io_recommender.git
Environment parameters need to be written in .env
file. You can simply copy it from .env.sample
to work with default parameters.
cd perceiver_io_recommender
cp .env.sample .env
These are the necessary parameters written in .env
file, you can edit it if necessary.
COMPOSE_PROJECT_NAME
:- Needed for docker-compose. For details please refer.
DEVICE
(Default:gpu
):- Device setting for docker. Parameters can be set to
gpu
orcpu
, but we tested the code only withgpu
parameter.
- Device setting for docker. Parameters can be set to
DATASET_PATH
Default:$(PWD)/dataset
):- Host directory for MIND dataset. It is mounted to
dataset/
directory from the container.
- Host directory for MIND dataset. It is mounted to
MODEL_PATH
(Default:$(PWD)/models
):- Host directory for pretrained models for GloVe and Transformer. It is mounted to
models/
directory from the container.
- Host directory for pretrained models for GloVe and Transformer. It is mounted to
LOG_PATH
(Default:$(PWD)/logs
):- Host directory for training logs. It is mounted to
logs/
folder from the container.
- Host directory for training logs. It is mounted to
VENV_PATH
:- Host directory for python virtual environment. It is mounted
.venv/
folder from the container.
- Host directory for python virtual environment. It is mounted
JUPYTER_PORT
:(Default:8888
)- The port number binded for the host OS access to the jupyter notebook that is launched in the container.
TENSORBOARD_PORT
: (Default:6006
)- The port number binded for the host OS access to the tensorboard that is launched in the container.
make setup
Download MIND dataset and put the zip file to a directory which is visible from the container. You can put it to the same folder with README.
make sh
Run below command in container to do all necessary preprocessing.
pipenv run preprocess-all data_path.train_zip=<path/to/MINDxxx_train.zip> data_path.valid_zip=<path/to/MINDxxx_dev.zip>
If you are working with the large dataset, please add this parameter to above command:
params.dataset_type=large
Run below command in container for training the perceiver-io model.
pipenv run train
Some of the optional parameters are listed below.
model
:naml
ornrms
(default:nrms
)
embedding_layer
:word_embedding
ortransformer
(default: word_embedding)
hparams.article_attributes
:- Can be selected from [title,body,category,subcategory](default:
[title,body,category,subcategory]
)
- Can be selected from [title,body,category,subcategory](default:
hparams.n_epochs
:- default: 3
hparams.max_title_length
:- Max. number of tokens from article titles(default:
30
)
- Max. number of tokens from article titles(default:
hparams.max_body_length
:- Max. number of tokens from article bodies(default:
128
)
- Max. number of tokens from article bodies(default:
hparams.batch_size.train
:hparams.batch_size.valid
:- (Default batch sizes are different depending on the selected embedding layer)
hparams.accumulate_grad_batches
:- Training batch size becomes
hparams.batch_size.train
*hparams.accumulate_grad_batches
- Training batch size becomes
dataset
:- If it is set to
precomputed
, it reads from serialized article text data hence fetching data during training can be speeded up.
- If it is set to
num_workers
:- For dataLoader(default:
4
)
- For dataLoader(default:
This library uses hydra as config manager and everything in config can be overwritten from the command line.
In the paper, we compared results with DIEN, so we are going to convert data from their repository in order to make apple-to-apple comparison.
First, download meta_Books.json from http://snap.stanford.edu/data/amazon/productGraph/categoryFiles/
and put it under dataset/amazon/books
.
Then, download local_train_splitByUser
and local_test_splitByUser
from DIEN and put it under dataset/amazon/books
.
Finally, run below command:
sh src/preprocess/scripts/amazon/preprocess_amazon.sh
Training can be done in the same way as we do for news recommendation on MIND dataset:
pipenv run python3 -m src.train.main dataset_name=amazon dataset_type=books hparams.n_negatives=1 model=perceiver_io hparams.word_pos_emb=True hparams.feat_type_emb=True dataset=precomputed hparams.article_attributes=[title,body,category]
If you already did preprocessing for news recommendation for MIND dataset, there is no extra preprocessing needed. To run training, please run below command:
pipenv run python3 -m src.train.main model=perceiver_io_category_prediction hparams_model=perceiver_io embedding_layer=word_embedding hparams.article_attributes=[title,body] hparams.classify_attr=category dataset=precomputed
pipenv run tensorboard
You can browse results from this link localhost:${TENSORBOARD_PORT}
in the host.