Movie Recomender System

Description

This project aims to build a movie recommender system with cleaned Netflix Prize data. The data is cleaned to the format looks like "userId,movieId,rating".

Guide

step 1. choose an algorithm - itemCF

We use itemCF because the number of users weighs more than movies. In the meanwhile, movies will not change frequently which helps lower computation. Last but not least, using user's historical data will be more convincing.

step 2. describe the relationship between movies - co-occurrence matrix

We use rating history to define relationship between movies. If a user has rated two movies, we consider that these two movies are related. Then we build a co-occurrence matrix to represent the relationship between different movies, with the format looks like "movieA:movieB relationship".

Finally, we normalize the co-occurrence matrix to make the result more accurate and transpose the matrix for computing with map reduce to the format looks like "movieB movieA=realtionship".

step 3. build a rating matrix group by user

With the format "userId movieA=rating,movieB=rating,movieC=rating,..."

step 4. multiply co-occurrence matrix and rating matrix

With the format "userId:movieId multiplyUnitResult"

step 5. sum up and compare

Then we sum up the result of multiplication grouped by user and movie and get a predicted rating to each movie by each user with the format looks like "userId:movieId predicted_rating"

We compare the predicted rating to the historical rating and find a problem. Let's take user_1's rating for example. We can find that the difference between movie_10001 and movie_10002 rated by user_1 varies from the predicted data to the historical data. Why and how to deal with it?

To be continued...

Reference:

使用Java API方式的MapReduce练习

用Hadoop构建电影推荐系统

Mapreduce(MR)读取配置文件的三种方式，遍历HDFS目录文件

tony-chenjy/MovieRecomenderSystem