Hybrid Collaborative Filtering with Neural Networks

Collaborative Fltering uses the ratings history of users and items. The feedback of one user on some items is combined with the feedback of all other users on all items to predict a new rating. For instance, if someone rated a few books, Collaborative Filtering aims at estimating the ratings he would have given to thousands of other books by using the ratings of all the other readers.

The following module tackles Collaborative Filtering by using sparse denoising autoencoders.

More information can be found in those papers

[TEMPO] A step-by-step tutorial is available here . It will be pushed soon :)

Dependencies:

  • torch
  • nn
  • xlua
  • nnsparse
  • optim

(optional) anaconda2

Summary

git clone git@github.com:fstrub95/Autoencoders_cf.git
cd Autoencoders_cf
cd data
wget http://files.grouplens.org/datasets/movielens/ml-10m.zip
unzip ml-10m.zip 
cd ../src
th data.lua  -ratings ../data/ml-10M100K/ratings.dat -metaItem ../data/ml-10M100K/movies.dat -out ../data/ml-10M100K/movieLens-10M.t7 -fileType movieLens -ratio 0.9
th main.lua  -file ../data/ml-10M100K/movieLens-10M.t7 -conf ../conf/conf.movieLens.10M.V.lua  -save network.t7 -type V -meta 1 -gpu 1
th computeMetrics.lua -file ../data/ml-10M100K/movieLens-10M.t7 -network network.t7 -type V -gpu 1

Your network is ready!

(Average time ~25min)

Step 1 : Convert the dataset

th data.lua  -xargs

This script will turn an external raw dataset into torch format. The dataset will be split into a training/testing set by using the training ratio. When side inforamtion exist, they are automatically appended to the inputs. The MovieLens and Douban dataset are supported by default.

Options
  -ratings  [compulsary] The relative path to your data file
  -metaUser The relative path to your metadata file for users 
  -metaItem The relative path to your metadata file for items 
  -tags     The relative path to your tag file 
  -fileType [compulsary] The data file format (movieLens/douban/classic) 
  -out      [compulsary] The data file format (movieLens/douban/classic)
  -ratio    [compulsary] The training ratio 
  -seed     seed 

Example:

th data.lua  -ratings ../data/movieLens-10M/ratings.dat -metaItem ../data/movieLens-10M/movies.dat -out ../data/movieLens-10M/movieLens-10M.t7 -fileType movieLens -ratio 0.9

For information, the datasets contains the following side information

Dataset user info item info item tags
MovieLens-1M true true false
MovieLens-10M false true true
MovieLens-20M false true true
Douban true info false

To compute tags, please use the script sparsesvd.py : sparsesvd.py [in] [out] [rank]

Example:

python2 sparsesvd.py ml-10M100K/tags.dat ml-10M100K/tags.dense.csv 50
th data.lua -xargs ... -tags ml-10M100K/tags.dense.csv

If you have want to use external data (for benchmarking purpose), please use the Classic mode. The classic mode takes up to four file as input:

  • training ratings
  • testing ratings
  • user side information
  • item side information

Training/Testing : You have to create two files:

  • [fileName].train
  • [fileName].test and provide the following argument to the scrit data.lua
ls dataset*
dataset.txt.train
dataset.txt.test
th data.lua -ratings dataset.txt

Please use the following format for the training/testing datasets: [idUser] [idItem] [rating]

  • idUser > 0 (id must start at 1)
  • idItem > 0
  • rating \in [-1;1]

Example:

1 2 0.31
2 3 0.5
1 5 -0.1

NB If your ratings are not included in [-1,1], you can modify the function preprocessing() in data/ClassicLoader.lua For instance, if the ratings are included in [1-5], use: preprocessing(x) return (x-3)/2 end

Side information : You can create two files:

  • [userFileName].txt
  • [itemFileName].txt
ls dataset*
dataset.txt.train
dataset.txt.test
th data.lua -ratings [fileName] -metaUser [userFileName].txt -metaItem [itemFileName].txt

Please use the following format for the side information datasets:

  • user side info : [idUser] [noInfo] [idUserInfo]:[value] [idUserInfo]:[value] ...
  • user item info : [idItem] [noInfo] [idItemInfo]:[value] [idItemInfo]:[value] ...

where

  • idUser/idItem > 0 (id must correspond to the training/testing datasets)
  • idUserInfo/idItemInfo > 0 (id must start at 1)
  • value \in [-1;1] Example:
1 2 5:0.31 12:-1
2 0
1 3 5:0.28 4:1 12:0.5

Step 2 : Train the Network

th main.lua  -xargs

You can either train a U-Autoencoders/V-Autoencoders. Both will compute a final matrix of ratings. Yet, U-encoders will mainly learn a representation of users while V-Autoencoders will mainly learn representation of items. Training a network requires to use an external configuration file (cf further for more explanation regarding this file). Basic configuration files are provided for both MovieLens and Douban datasets.

Options
  -file [compulsary] The relative path to your data file (torch format). Please use data.lua to create such file.
  -conf [compulsary] The relative path to the lua configuration file
  -seed The seed. random = 0
  -meta [compulsary] use metadata false=0, true=1
  -type [compulsary] Pick either the U/V Autoencoder. 
  -gpu  [compulsary] use gpu. CPU = 0, GPU > 0 with GPU the index of the device
  -save Store the final network in an external file 

Example:

th main.lua  -file ../data/movieLens-10M/movieLens-10M.t7 -conf ../conf/conf.movieLens.10M.V.lua  -save network.t7 -type V -meta 1 -gpu 1

NB: Saving the network let you use it for recommendation tasks.

You can configure the network architecture and training by modifying the file config.template.lua it has the following structure:

local config = 
{
   layer1 = 
   {
      layerSize = 100,    
    { Training 1 }
   },
   layer2 =
   {
     layerSize = 50,    
     { Training 1 },  --inner hidden layers
     { Training 2 },  --final network
    },
    layer3 =
   { 
     layerSize = 20,    
    { Training 1 }, -- inner hidden layers
    { Training 2 }, -- intermediate hidden layers
    { Training 3 }, -- final network
    }
    etc.
}
return config

Autoencoders are iteratively trained, stacked and fine-tuned.

"Training" is defined as follow:

{
   noEpoch = 15,             -- number of epoch to train the layer
   miniBatchSize = 35,       -- minibatch size 
   learningRate = 0.02,      -- Learning rate
   learningRateDecay = 0.5,  -- Learning rate decay lrt = lrt / (1+lrt_dec)   
   weightDecay = 0.03,       -- L2 regulizer
   criterion = cfn.SDAECriterionGPU(nn.MSECriterion(),  -- define the training loss
   {
      alpha = 1,        -- prediction hyperparameter 
      beta  = 0.5,      -- reconstruction hyperparameter
      hideRatio = 0.2,  -- Maksing noise ratio
   }), 
}

Step 3 : Recommender System

Once the network is trained, it is possible to use it as a recommender system. For now, it is possible to compute the RMSE by sorting the users/items regarding their number of ratings.

Further work will enable to directly suggest items to users (or users to items!)

Benchmarks

The SVD and ALS-WR algorithms are provided for benchmarking for medium size datasets. For bigger datasets, we adivese to use mahout

  • ALS-WR :
th ALS.lua  -xargs

-file         The relative path to your data file.              
-lambda       Rank of the final matrix                             
-rank         Regularisation                                      
-seed         The random seed                                   
  • Gradient :
th GradDescent.lua  -xargs

-file         The relative path to your data file.              
-lambda       Rank of the final matrix                         
-rank         Regularisation                                     
-lrt          Learning Rate                                    
-seed         The random seed