Last edit : 29 July 2019
Recommender system referencing KPRN, original github, trained using custom MovieLens-20M dataset.
The model implemented has slight difference where no pooling layer added at the end of LSTM.
Given a path between user and an item, predict how likely the user will interact with the item
-
/cache
: temporary files used in training -
/data
: contains dataset (custom ml-20m dataset where only movies shows up in Ripple-Net's knowledge graph used) used in training. -
/log
: contains training result stored in single folder named after training timestamp. -
/test
: contains jupyter notebook used in testing the trained models -
KPRN-LSTM.ipynb
: notebook to train model -
main.py
: python3 version of KPRN-LSTM.ipynb
*italic* means this folder is ommited from git, but necessary if you need to run experiments
**bold** means this folder has it's own README, check it for detailed information :)
pip3 install -r requirements.txt
-
Unzip
ratings_re.zip
andratings_re.z01
in/data
-
To preprocess, run
Preprocess.ipynb
notebook orpreprocess.py
python3 data/preprocess.py
-
To train, run
KPRN-LSTM.ipynb
notebook ormain.py
python3 main.py
To start using new dataset, or if you wish to generate new dataset, please delete all items inside /cache
Open KPRN-LSTM.ipynb
or main.py
and change the model parameters
- Find the training result folder inside
/log
(find the latest), copy the folder name. - Create copy of latest jupyter notebook inside
/test
folder. - Rename folder to match a folder in
/log
(for traceability purpose). - Replace
TESTING_CODE
at the top of the notebook. - Run the notebook
Evaluation size | Prec@10 | Distinct@10 | Unique items |
---|---|---|---|
Eval on 10 user | 0.12028 | 0.70000 | 70 |
Eval on 30 user | 0.16667 | 0.60667 | 182 |
Eval on 100 user | 0.17471 | 0.38600 | 386 |
Evaluation size | Prec@10 | Distinct@10 | Unique items |
---|---|---|---|
Eval on 10 user | 0.20000 | 0.32000 | 32 |
Eval on 30 user | 0.24333 | 0.21000 | 63 |
Eval on 100 user | 0.25864 | 0.13400 | 134 |
Evaluation size | Prec@10 | Distinct@10 | Unique items |
---|---|---|---|
Eval on 10 user | 0.18667 | 0.31000 | 31 |
Eval on 30 user | 0.18000 | 0.14667 | 44 |
Eval on 100 user | 0.23453 | 0.08400 | 84 |
KPRN relies heavily upon paths, and those paths are handcrafted by using the knowledge-graph. The paths are also sampled from hundreds of million possible paths.
-
To find paths
(user -> seed item (eg: Castle on The Hill) -> entity (eg: Ed Sheeran) -> suggestion (eg: Perfect))
from each seed, we can extract millions of paths (if not sampled), even after sampled using only one relation (eg: same artist, same albums, etc per seed, it still generates around 8k-10k path per seed.
-
Each user has multiple item work as seed (typically 20+), this need to be sampled again to reduce paths generated and reduce computational cost.
-
We do make sure each item in suggestion has about 4 - 7 paths
-
At this point, we only evaluate on around 75 - 150 path per user, out of possible hundred million possible paths
-
That's a huge possible source of sampling bias, but at the same time, it's kinda impossible to search through all paths.
-
Looking from the result of KPRN, the usage of KG might turn out to be quite promising, especially to improve the diversity of suggestion.
-
The downside of using KPRN is that the result rely heavily on 'handcrafted' paths, which undergoes a lot of downsampling steps.
-
Summary compared to non-KG RecSys: Big improvement in terms of Prec@k and distinct rate
- Able to incorporate Knowledge Graph as another source of information
- Able to infer why a user is given such suggestions (based on path scores)
- Able to adjust between 'exploration and optimization' by applying result pooling (the model doesn't require to be re-trained)
- During training, the model converge really fast (< 10 epochs)
- No original implementation usable
- Relies heavily upon paths, and those paths are 'handcrafted' by using the knowledge-graph and also sampled from hundreds of million possible paths.
- Huge possible sampling bias introduced from preprocessing step and path generation step.
- The model remember the user, the model need to be re-trained for every new user and item addition.
- Require relatively slow preprocessing
- Super slow train and prediction time
- Loss function and metric used in training is not Prec@K, instead it uses accuracy.
- At the cost of slightly different implementation, it's easier to implement using high-level libraries such as Keras instead of using original version.
- path generation & predict time : about 4k path / second
- different sampling method and sampling parameter has insignificant effect
- Using more items as path generation 'seed' (for predicting suggestion), should lead to more personalized suggestions. (i.e. consider the suggestion by using more user history)
- By pooling path prediction-score for the same items, the model should be able to give a better suggestion since it considers multiple reasons instead of just a single reason.
- Jessin Donnyson - jessinra@gmail.com
- Michael Julio - michael.julio@gdplabs.id
- Fallon Candra - fallon.candra@gdplabs.id