/CTMP-ThesisProject

Master Thesis @ Charles University - "Probabilistic Models for Recommender Systems"

Primary LanguageRich Text FormatMIT LicenseMIT

Master Thesis - "Probabilistic models for Recommender Systems"

Collaborative Topic Model for Poisson distributed ratings (CTMP) with the application of Online Maximum a Posteriori Estimation with Bernoulli randomness (BOPE).

CTMP is a hybrid and interpretable probabilistic content-based collaborative filtering model for recommender system. The model enables both content representation by admixture topic modelling called Latent Dirichlet Allocation (LDA) and computational efficiency from assumption of Poisson distributed ratings, living together under one tightly coupled probabilistic model, thus addressing the limitation of previous methods. The paper was released in April 2018, and it is considered one of the latest approaches in commercial product recommendation (movies, documents, scientific articles).

BOPE is the inference method used in MAP problems which are non-convex and intractable. It has a fast convergence rate and implicit regularization. The paper was released in May 2020 and it is the latest novel method among MAP estimation methods.

I have implemented CTMP model augmented with BOPE from scratch in Python and studied its behaviour on MovieLens 20M and NETFLIX datasets regarding the movie recommendations. Experimental studies have been carried out for evaluating the ability of model on Recall, Precision, Perplexity, Sparsity, Topic Interpretation and Transfer Learning between datasets. For more details, please refer to paper in this link.

Technologies: Python(Numpy, Scipy, Numba, Pandas, Matplotlib, NLTK), SQL, Google Cloud

The most attention is put on Time & Space complexity of the model, thus scientific computing libraries such as Numpy and Scipy are used in most operations along with Numba library which boosts the computational speed by parallelizing the numpy-heavy functions with JIT(just-in-time compilation). After the model implementation is completed, it is then deployed to Google Cloud's Virtual Machine with high performance CPUs considering that Numpy/Scipy environments are based on BLAST - a high-performance computing architecture for CPU.

Below, the most important directories are illustrated for the purpose of overview:

├── CTMP
│   ├── common
│   ├── experimentation
│   ├── input-data
│   ├── model
│   ├── output-data
├── db-files
│   ├── original-files
│   ├── processed-files
├── pre-CTMP
├── papers
│   ├── others
│   ├── variational-inference

Short Explanation:
CTMP     -> model implementation and experimental studies.
df-files -> data fetch from Oracle Database and first phase of pre-processing.
pre-CTMP -> second phase of pre-processing, i.e. Vocabulary Extraction, Document Representation.
papers   -> papers regarding the models along with the techniques used in the field of recommender systems (e.g, CTPF, CTR, Variational Inference)
 

If you want to check the main code where the model is implemented, visit;
./CTMP/model/CTMP.py
./CTMP/model/run_model.py
./CTMP/model/Evaluation.py

Some results from experimental studies:

  • Recall & Precision graph and Sparsity Graph
  • Topics extracted from corpus
killer murder police detective one man case young find serial
band rock music life one new love singer world young
ship gold find two one captain island treasure young get
school high one friends life students college student new girl
king evil world young must one princess prince queen love
film documentary life one world new films story first interviews
husband wife married life tom one love two get marriage
...
...
...
life new love one young singer old career family york
war german world army soldiers one british men american story
love film young movie falls life man director one world
prison life man years one two young story new time
christmas one new life get time boy young find santa
president war one political film government american life man country
police drug gang one crime new cop man two life