mreco

mreco is a Movie Recommendation System that leverages algorithms based on:

User-user collaborative filtering (CF)
Item-item collaborative filtering (CF)
Dimensionality Reduction (Matrix Factorization)

Tasks

Scrape Imdb and build mongodb collection
Create web app
Implement Collaborative filtering (CF) algorithms
Implement Matrix Factorization algorithm
Deploy on heroku
Create docker container
Write tests (WIP)
Add Continuous Integration (WIP)

Getting started

Install a virtual environment on mac/linux or windows

$ pip install -r requirements.txt

Install MongoDB and start a mongod process. Use the collection dump provided in the repository. Instructions

You can also run this command to import movies from data.json file:

$ mongoimport --db imovies --collection movie --file data.json --jsonArray

Set environment variables:

$ export FLASK_ENV=development
$ export FLASK_APP=mreco

Run the app:

$ flask run

Dependencies

Web Framework - Flask
Database - MongoDB
Document-Object Mapper - mongoengine
ML and data science libraries - scikit-learn, pandas, numpy, scipy
Database Service (used in production) - mlab
Cloud Application Platform - Heroku
Frontend Framework - Bootstrap
Container Platform - Docker

Directory Structure

.
├── Dockerfile
├── Procfile
├── README.md
├── data.json
├── docker-compose.yml
├── dump
│   └── imovies
│       ├── movie.bson
│       ├── movie.metadata.json
│       ├── rating.bson
│       ├── rating.metadata.json
│       ├── user.bson
│       └── user.metadata.json
├── instance
├── m.csv
├── mreco
│   ├── __init__.py
│   ├── forms.py
│   ├── models.py
│   ├── recommender.py
│   ├── routes.py
│   ├── static
│   │   ├── logo.png
│   │   └── mr.png
│   └── templates
│       ├── auth
│       │   ├── login.html
│       │   └── register.html
│       ├── base.html
│       ├── index.html
│       └── movie.html
├── r.csv
├── r2.csv
├── requirements.txt
├── runtime.txt
└── u.csv

Approach followed

Scraped movies data from Imdb and saved it in a json file.
Imported the json file into a mongodb collection.
Build a flask webapp with user authentication, CRUD (rating movies).
mreco/recommender.py contains 3 algorithms:
- Popularity based recommender: It recommends the most popular movies, same recommendations to all the users. I have just written the algorithm but not used it in the actual app.
- Item similarity recommender (CF): It computes similarity between items and recommends items that are similar to an item that is liked by a particular user. Here, item means a movie. Suppose u is the currently logged in user. High level pseudocode of the algorithm:
```
for every item i that u has no preference for yet:
  for every item j that u has a preference for:
    compute a similarity s between i and j
    add u's preference for j, weighted by s, to a running average
return the top items, ranked by weighted average
```
- User similarity recommender (CF): It computes similarity between users and recommends items liked by a given user to the other users who are similar to the given user. Suppose u is the currently logged in user. High level pseudocode of the algorithm:
```
for every item i that u has no preference for yet:
  for every other user v that has a preference for i:
    compute a similarity s between u and v
    add v's preference for i, weighted by s, to a running average
return the top items, ranked by weighted average
```
mreco/routes.py contains all the url routes, and also a method matrix_factorization where the Dimensionality Reduction (low ranked matrix factorization) based algorithm has been implemented. I have used Singular Value Decomposition (SVD) to create a low ranked matrix.
This app currently generated csv files of ratings and users, so it isn't very efficient. However it can be improved by making lesser calls to the function generating the csv files or by directly creating pandas dataframes from the database.

Conclusion

Item-item CF works well when there are a larger number of items(movies) as compared to the number of users.
User-user CF is more suited when number of users exceeds number of items.
The above 2 algorithms aren't helpful to solve the 'cold start problem', i.e., when there are very few items rated, and there's no similarity between items/users. You might have seen in the webapp, whenever you're creating a new account and rating one or two movies, then you might get 0 recommendations using the above CF algorithms.
To solve the cold start problem, we use Matrix factorization method. Let's say we have a new user who hasn't watched enough movies yet, but we can still recommend movies to that user.
We can't generalize and consider any of these as the best algorithm, it depends on the situation and needs. If the user wants to see the most popular movies then we would use a simple popularity based recommender, which would give us the desired results.

Setting up using Docker

Create an account on https://mlab.com/ and follow the instructions provided there to import the movies dataset from your local MongoDB database (do this after importing the dataset into your own mongodb /data/db directory).
Open __init__.py and update your mlab credentials (MONGODB_HOST, MONGODB_USERNAME, MONGODB_PASSWORD).
Run docker desktop on your machine.
cd into this repository and run: docker build --tag=mreco .
docker run -p 4000:80 mreco
Visit http://localhost:4000 to see mreco in action!

Credits

Inspired by https://towardsdatascience.com/the-4-recommendation-engines-that-can-predict-your-movie-tastes-109dc4e10c52 and https://towardsdatascience.com/how-to-build-a-simple-song-recommender-296fcbc8c85

rajgoesout/reco-engine