RME

This repo contains source code for our paper: "Regularizing Matrix Factorization with User and Item Embeddings for Recommendation" published in CIKM 2018. We implemented using multi-threads, so it is very fast to run with big datasets.

DATA FORMAT

Data format:

First line: the header "userId,movieId"
Second line --> last line: [userId],[movieId]

data for running our source code: ml10m.

We preprocessed it and splitted into train, vad/dev, test. Their paths are:

data/ml10m/train.csv
data/ml10m/test.csv
data/ml10m/validation.csv

format of user and disliked items: same as previous format:

First line: the header "userId,movieId"
Second line --> last line: [userId],[movieId]

When we have available users and dislike items:

do 2 steps:

saved it to data/ml10m/train_neg.csv
build the disliked item-item co-occurrence by running (assume that the dataset is ml10m): python produce_negative_cooccurrence.py --dataset ml10m

RUNNING:

Step 1.1: produce user-user co-occurrence matrix and item-item co-occurrence matrix

python produce_positive_cooccurrence.py --dataset ml10m

Step 1.2: prduce negative co-occurrence matrix of item-item (if the dislike items are available, if not available, that's ok, we will infer disliked item in step 2):

python produce_negative_cooccurrence.py --dataset ml10m

Step 2.1: run RME with available disliked items (in case you ran step 1.2 already):

python rme_rec.py --dataset ml10m --model rme --neg_item_inference 0 --n_factors 40 --reg 1.0 --reg_embed 1.0

Step 2.2: run RME with our user-oriented EM-like algorithm to infer disliked items for users (in case disliked items are not available, and you are not able to run step 1.2):

python rme_rec.py --dataset ml10m --model rme --neg_item_inference 1 --n_factors 40 --reg 1.0 --reg_embed 1.0

where:

model: the model to run. There are 3 choices: rme (our model), wmf, cofactor.
reg: is the regularization hyper-parameter for user and item latent factors (alpha and beta).
reg_emb: is the regularization hyper-parameter for user and item context latent factors (gamma, theta, delta).
n_factors: number of latent factors (or embedding size). Default: n_factors = 40.
neg_item_inference: whether or not running our user-oriented EM like algorithm for sampling disliked items for users. In case we have available user-disliked_items --> set this to 0.
neg_sample_ratio: negative sample ratio per user. If a user consumed 10 items, and this neg_sample_ratio = 0.2 --> randomly sample 2 negative items for the user. Default: 0.2.

other hyper-parameters:

s: the shifted constant, which is a hyper-parameter to control density of SPPMI matrix. Default: s = 1.
data_path: path to the data. Default: data.
saved_model_path: path to saved the optimal model using validation/development dataset. Default: MODELS.

You may get some results like:

top-5 results: recall@5 = 0.1559, ndcg@5 = 0.1613, map@5 = 0.1076
top-10 results: recall@10 = 0.1513, ndcg@10 = 0.1547, map@10 = 0.0851
top-20 results: recall@20 = 0.1477, ndcg@20 = 0.1473, map@20 = 0.0669
top-50 results: recall@50 = 0.1819, ndcg@50 = 0.1553, map@50 = 0.0562
top-100 results: recall@100 = 0.2533, ndcg@100 = 0.1825, map@100 = 0.0579

running some baselines: Cofactor, WMF:

Running cofactor:

python rme_rec.py --dataset ml10m --model cofactor --n_factors 40 --reg 1.0 --reg_embed 1.0

You may get the results like:

top-5 results: recall@5 = 0.1522, ndcg@5 = 0.1537, map@5 = 0.1000
top-10 results: recall@10 = 0.1383, ndcg@10 = 0.1425, map@10 = 0.0756
top-20 results: recall@20 = 0.1438, ndcg@20 = 0.1391, map@20 = 0.0606
top-50 results: recall@50 = 0.1762, ndcg@50 = 0.1484, map@50 = 0.0518
top-100 results: recall@100 = 0.2545, ndcg@100 = 0.1783, map@100 = 0.0540

Running WMF:

python rme_rec.py --dataset ml10m --model wmf --n_factors 40 --reg 1.0 --reg_embed 1.0