Recommendations Engine

This is an implementation of a neighbor-based matrix decomposition recommendations engine using Yelp's mrjob package to run the program on Amazon's Elastic MapReduce.

Quickstart

Install Python packages Yelp's mrjob package and NumPy.

build_matrix.py will read in a directory, build the initial ratings matrix, and create predicted ratings that can be output to the terminal.

mr_job_recommendations.py will load:

dictionary of movie ratings - (Key: (movie_id, user_id), Value: rating)

dictionary of movies and users who have rated each movie, used to build neighborhoods - (Key: movie_id, Value: list of users who have rated that movie)

The map function places movies into neighborhoods and the reduce function produced predicted ratings through matrix decomposition. Yelp's mrjob package has great documentation and can be configured to Hadoop and Amazon EMR jobs.

Within the Python files, you can also set the following inputs to affect the accuracy of the algorithm, and the impact of these parameters are discussed in more detail below:

    K     : the number of latent features
    iterations : the maximum number of iterations toward optimization
    alpha : the learning rate (rate at which vectors in P and Q are incremented toward existing ratings)
    beta  : the regularization parameter (prevents overfitting)

File Manifest

build_matrix.py - functions that read in data files, produce neighborhoods, performs matrix decomposition, and creates predictive ratings matrix.

mr_job_recommendations.py - program that runs the matrix decomposition as a MapReduce job using Yelp's MRjob package

build_test_data.py - program that builds files in the same format as the Netflix data to produce practice datasets

netflix_local_data - Reduced datasets in the same format as the data provided by Netflix. The first line of every file is the movie_id, and each line thereafter is in the format "user_id,rating,date"

movie_ratings_dictionary_NLD, movies_and_users_who_rated_NLD - Dictionaries from the local data that cna be loaded in the mr_job_recommendations.py file to run the MapReduce job locally

The Netflix Prize

In 2006, Netflix announced an open competition for the best collaborative filtering algorithm to predict user ratings for films based solely on previous ratings. Netflix provided a dataset of 480,189 users, 17,770 movies, and 100,480,507 ratings.

How does one calculate predicted ratings for each (movie, user) pair from a very sparse seed dataset?

The Solution

Neighbor-based Matrix Decomposition

Two factor matrices are generated and the dot-product of the vectors from the factor matrices are checked against the existing ratings and tweaked until the error between the dot-products is very small. More information the Incremental Matrix Factorization technique can be found here

Once the factor matrices are built, the product results in the predicted ratings matrix.

MapReduce

Because of the size of the dataset, one can distribute computing on a Hadoop cluster. I divided the dataset into neighborhoods and used Yelp's mrjob package and Amazon's Elastic MapReduce.

Movies were divided into neighborhoods by the number of common users who had rated those movies in the mapping step. The reducing step performs the matrix decomposition on the neighborhood.