Recommendation System

Final Project for the UPC Artificial Intelligence with Deep Learning Postgraduate Course, Spring 2021.

Authors: Abel Martínez, Rafael Pérez, Miguel Ibero
Team Advisor: Paula Gomez Duran
Date: July 2021

Introduction
Overview
- Setup & Usage
- Architecture
Experiments
Conclusions
Future Work
Bibliography

Introduction

The goal of this project is to develop a state-of-the art collaborative filtering recommender system using machine learning that can be trained on multiple datasets. Collaborative filtering is the method used by tech companies like Amazon and Netflix for their recommender systems as it can be trained without having user & item metadata.

Factorization machines were proposed in this 2010 paper as a way of doing matrix factorization that improves on the classic single value decomposition models by being more optimized for sparse datasets and supporting additional interaction context, these characteristics make this model ideal for recommender systems. In addition to that, we also want to evaluate this paper that proposes replacing the factorization machine embedding with a graph convolutional network.

Overview

Setup & Usage

The whole project is implemented as a command-line tool. We provide the following commands:

nnrecommend train train a model on a dataset
nnrecommend fit fit a dataset using surprise algorithms
nnrecommend tune tune model hyperparameters using ray tune
nnrecommend explore-dataset show information about a dataset
nnrecommend recommend load a trained model to get recommendations

Please find the tool setup and usage instructions in this separate file.

Architecture

Model

Our model classes can be found in the nnrecommend.model namespace. The FactorizationMachine equation is split in two classes; LinearFeatures implements the linear regression part of the equation, while FactorizationMachineOperation implements the square of sum minus sum of squares part. The embeddings can be of three types, using the normal torch.nn.Embedding for linear embeddings, using GraphEmbedding which internally uses torch_geometric.nn.GCNConv, and using GraphAttentionEmbedding which internally uses torch_geometric.nn.GATConv. Over the embedding module we added a dropout to prevent the model from overfitting.

We implemented the Binary Personalized Ranking loss as explained in this paper in the BPRLoss class.

Since we wanted to evaluate this new model against existing recommender system models, we implemented adapters to the different algorithms supported by the scipy surprise library. These can be found in the nnrecommend.algo namespace.

Dataset

All the classes related to reading the datasets for the different types of data can be found in the nnrecommend.dataset namespace.

InteractionDataset is the main dataset class used in the tool. It takes care of dealing with a two dimensional numpy array of interactions where every column is formed by consecutive ids. To guarantee its correct behavior, we wrote some unit tests for it that can be found in the test_dataset.py file.

The InteractionPairDataset class allows the system to train de model with BPRLoss by converting the dataset into pairs of positive and negative interactions.

The GroupingDataset class is used in the testset to make the trainloader (set with batch=1) return the interaction groups for the same user.

Since we want to be able to load different dataset sources, each one needs to implement BaseDatasetSource, this class will setup the basic dataset logic like removing low interactions, adding previous item context, extracting the testset interactions, etc... This is currently implemented for the movielens, spotify & podcasts datasets. Every new dataset would need to load:

self.trainset: the InteractionDataset with the data
self.items: a pandas DataFrame containing item metadata (used in the recommender)

Operations

Since most of the commands have similar flows, we implemented them in reusable classes in the nnrecommend.operation namespace.

The Setup class takes the nnrecommend.dataset.BaseDatasetSource and configures the negative sampling based on the provided hyperparameters.

The Trainer class contains the core training logic, it supports both pairwise datasets and normal ones.

The Tester class is used to obtain the evaluation metrics from a model or algorithm. It supports the following:

Hit Ratio (HR): Measures whether the real test item is in the top positions of the recommendation list
Normalized Discounted Cumulative Gain (NDCG): Measures the ranking quality which gives information about where in the raking is our real test item.
Coverage (COV): Measures the amount of total items in the topk positions.

The RunTracker class sends the training & testing metrics to tensorboard and could be extended to send them to other systems.

The Finder uses the fuzzywuzzy fuzzy string matching library to find for items in a pandas DataFrame that mach a given string. This is used in the recommend subcommand to look for items without having to remember their ids.

The Recommender class is used in the recommend subcommand to generate a dataset to get item recommendations running it through a pretrained model.

Hyperparameters

We created a nnrecommend.hparams.HyperParameters class that can load hyperparameters from the command line or from files. Hyperparameters can be specified using --hparam name:value from the command line.

To load the ray tune config when running hyperparameter tuning, we also provide a way of specifying the hyperparameter ranges using json files. This allows us to tune with different values and not have to change the code every time.

For more information on the specific json formats, please read the usage information section on hyperparameters.

Experiments

Movielens Evaluation

The movielens dataset consists of 100k interactions between 943 users and 1682 movies.

We found this 2019 paper that lists evaluation metrics for different recommender systems using this dataset. The best two models listed are NeuACF and NeuACF++.

model	hit ratio	ndcg
ItemKNN	0.5891	0.3283
NeuACF	0.6846	0.4068
NeuACF++	0.6915	0.4092

Our hypothesis is that we should be able to reproduce similar metrics and hopefully improve them using factorization machines with GCN. We run our models matching the hyperparameters specified in the paper to be able to compare their results with ours. You can download the dataset from this link and place it under the ml-100k folder.

hparam	value
`embed_dim`	64
`negatives_train`	10
`negatives_test`	99
`topk`	10

Initial evaluation running with fm-linear shows that we're already better than knn, but still not better than NeuACF or NeuACF++.

nnrecommend --hparams-file hparams/movielens/initial_hparams.json train ml-100k --dataset movielens --model fm-linear --tensorboard tensorboard
nnrecomment --hparams-file hparams/movielens/initial_hparams.json fit ml-100k --dataset movielens --algorithm knn --tensorboard tensorboard

We enabled pairwise_loss (Binary Personalized Ranking loss) and added previous item context we achieve better results. Then we optimized the model parameters using the nnrecommend tune command.

nnrecommend -v tune ml-100k --dataset movielens --config hparams/movielens/tune_config.json --model fm-linear
nnrecommend -v tune ml-100k --dataset movielens --config hparams/movielens/tune_config.json --model fm-gcn

The tuned parameters for the training were:

for fm-linear:

parameter	value
`learning_rate`	0.0005
`batch_size`	1024
`embed_dropout`	0.5

for fm-gcn:

parameter	value
`learning_rate`	0.001
`batch_size`	1024
`embed_dropout`	0.4

The json files to load the hyperparameters can be found in the hparams/movielens repository folder, the resulting tensorboard data is in the results/movielens folder.

To obtain the final evaluations we ran the following commands.

nnrecommend --hparams-file hparams/movielens/linear_testset_hparams.json train ml-100k --dataset movielens --model fm-linear --tensorboard tensorboard
nnrecommend --hparams-file hparams/movielens/gcn_testset_hparams.json train ml-100k --dataset movielens --model fm-gcn --tensorboard tensorboard
nnrecommend --hparams-file hparams/movielens/gcn_testset_hparams.json train ml-100k --dataset movielens --model fm-gcn-att --tensorboard tensorboard

comparing fm-linear, fm-gcn and fm-gcn-att without context

comparing fm-linear, fm-gcn and fm-gcn-att with previous item context

We can observe that:

our baseline & knn metrics match more or less with the paper
our fm-linear matches with the paper FMG
GCN has better results than linear
GCN with attention has better results than GCN in terms of coverage
adding previous item context improves the metrics

type	model	context	hit ratio	ndcg	coverage
paper	ItemPop		0.3998	0.3998
nnrecommend	baseline		0.4051	0.4051	0.2348
paper	ItemKNN		0.5891	0.3283
nnrecommend	knn		0.5716	0.3422	0.8062
paper	FMG		0.6373	0.3588
nnrecommend	fm-linear		0.6458	0.3658	0.5458
nnrecommend	fm-gcn		0.6543	0.3792	0.5856
nnrecommend	fm-gcn-att		0.6596	0.3883	0.6225
paper	NeuACF		0.6846	0.4068
paper	NeuACF++		0.6915	0.4092
nnrecommend	fm-linear	prev	0.7264	0.4453	0.6046
nnrecommend	fm-gcn	prev	0.7370	0.4611	0.6165
nnrecommend	fm-gcn-att	prev	0.7349	0.4581	0.7206

Qualitative User Recommendations

If we store the trained model using the --output parameter, we then can run nnrecommend recommend to obtain new movie recommendations for the users.

As an example, for the user 600 we can run:

nnrecommend --hparams-file hparams/movielens/linear_best_hparams.json train ml-100k --dataset movielens --model fm-linear --output movielens.pth
nnrecommend recommend movielens.pth --label 600 --user-items 3

Result would be:

user 600 rated 148 movies
some movies that the user rated were:
The Godfather (1972)
Strange Days (1995)
Sneakers (1992)
----
highest recommended movies for the user:
The Fugitive (1993)
Toy Story (1995)
When Harry Met Sally... (1989)

To achieve this what we're doing in the Recommender class is we generate a dataset with the specified user in the user column and all the items in the item column and run it through the pretrained model. Highest predicted ratings are the top recommended movies.

Spotify Evaluation

Now that we see that our models matched the results of the paper in the movielens dataset, we will try them out on another dataset, Spotify sessions and songs that were released for the sequential skip prediction challenge. The dataset can be downloaded from here.

Our hypothesis is that we can train our models on this data and obtain similar results to the movielens ones. In addition to this, since the dataset includes a lot of metadata, we want to add other context rows and evaluate if those improve the metrics.

The full dataset is huge (more than 200Gb uncompressed) and contains 50M sessions and 2.7M songs, but they also provide a miniset with a size that is more suited for our tests. After analyzing it though, we found that the distribution of amount of items per user was too sparse for our use case. We then extracted a new mini dataset with a histogram more similar to the real one. You can download it from here.

When running the models on the Spotify dataset without context we were achieving very good results so we switched to negative_test=-1, meaning that we add all the negative interactions when evaluating the metrics.

After tuning the hyperparameters we came up with the following values:

for fm-linear:

parameter	value
`learning_rate`	0.001
`batch_size`	1024
`embed_dropout`	0.5

for fm-gcn:

parameter	value
`learning_rate`	0.01
`batch_size`	1024
`embed_dropout`	0.3

To obtain the final evaluations we ran the following commands.

nnrecommend --hparams-file hparams/spotify/linear_testset_hparams.json train results/spotify/mini-spotify-data.csv --dataset spotify --model fm-linear --tensorboard tensorboard
nnrecommend --hparams-file hparams/spotify/gcn_testset_hparams.json train results/spotify/mini-spotify-data.csv --dataset spotify --model fm-gcn --tensorboard tensorboard

Since this dataset contains around 10 times more users and items than movielens one, we were not able to run the training using the GCN with attention with the same hyperparameters, we were getting CUDA out of memory errors on a 8Gb GPU. So we had to reduce them a bit, but we still wanted to try if the results would be better.

for fm-gcn-att:

parameter	value
`embed_dim`	60
`learning_rate`	0.01
`graph_attention_heads`	1

nnrecommend --hparams-file hparams/spotify/gcn_att_testset_hparams.json train results/spotify/mini-spotify-data.csv --dataset spotify --model fm-gcn-att --tensorboard tensorboard

We can see that without context, GCN (pink) is a bit better than linear (orange), and GCN with attention (red) improves on both even with less hidden dimensions.

With previous item context all the models improve and we have still better results with GCN (green).

Other possible contexts

Since this dataset has much more metadata we evaluated the following context:

skipped: if the user skipped the song at the beginning, middle, end or not at all
count: count how many times the song was played by the user

We added a context column with 1 if the user did not skip the song and played it multiple times, but evaluation metrics started worse than without any context and tended to go to the same values in the end, so this change did not work. Using other dataset contexts like hour of day or song release year produced similar results. We assume that this behaviour is due to a not uniform enough distribution of these contexts and the fact that they don't seem to correlate with the actual interactions of users and items.

Addressing The Cold Start Problem

We found that our models we're able to generate recommendations for existing users, but since this is a collaborative filtering recommender system, it suffers from the cold start problem, i.e. it cannot recommend items to users that are not in the dataset.

To address this issue we propose a modification of the InteractionDataset that goes as follows:

remove the user column
compute previous item context column
remove all the rows with -1 previous item (no previous item)
switch the item column with the previous item context column

With this change we have a dataset where given a previous item, the factorization machine should return a rating for the next item. Our hypothesis is that we can use this modified dataset to train the same factorization machine model to recommend an item to new users that only need to provide 1 item they like.

To evaluate this hypothesis we can use the same metrics as in the normal factorization machine, we will just group by previous item instead of by user. That said, to test the trained model and obtain some sample recommendations, it would be best to have dataset that contains item metadata. Therefore we will test with the movielens dataset, since the spotify one does not contain song names.

We train a model using the recommend hyperparameter to enable this new functionality.

nrecommend --hparam recommend:1 --hparam interaction_context: train data/ml-100k --dataset movielens --output movielens_recommend.pth

As we can see, the recommender evaluation metrics are improving, that means that we're correctly training the model. And then we can use the nnrecommend recommend command to get recommendations for movies.

nnrecommend recommend movielens_recommend.pth --label "star wars"

When asking for recommendations similar to Star Wars we're getting Return Of The Jedi as the second highest ranked, seems like a good recommendation.

Conclusions

We managed to reproduce the recommender system metrics shown in this 2019 paper using the movielens dataset.

We managed to extract a subset of the spotify dataset that could be used to train recommender systems and obtained good results. On the other hand we were not able to find additional context data that improved them further.

We implemented Bayesian Personalized Ranking loss and previous item context and showed that they improve the Factorization Machine evaluation metrics.

We showed that using Graph Convolutional Networks for the embedding of a Factorization Machine leads to better results.

We proposed a solution that adapts the factorization machine model to deal with the cold start problem.

Future Work

To get better results we could implement the neural version of the factorization machine as explained in this paper.

The current model could be extended to add user & item features in the graph convolutional network, right now the features matrix that can be provided to the GCN is always an identity.

The item recommendation algorithm could be extended to add multiple previous item colums and then the item recommendation could be more accurate if a new user provides multiple items they liked.

We could evaluate adding other interaction contexts that can be found in the spotify dataset and see how they affect the different models.

We could implement a frontend for the recommend subcommand to get recommendations via web, for example using the flask framework.

paulagd/aidl-fm-gce