The goal of this team project is to develop a recommendation algorithm for movies. The data for this project comprises 910,190 ratings (on a 1 to 5 scale) that were given by 6,040 users to 3,706 movies. Next to the ratings, the data contains some information on the users (gender, age, and profession) and information on the movies (title and year of release). We develop an algorithm that, based on this data, predicts as good as possible what rating a particular user will give to a particular movie.
We also applied a hybrid approach that builds upon CF to improve its predictability by combining information extracted from demographic data (figure below) and content (movie) data.
SGD-based factorization has been a successful method for recommendation algorithms and it was used by the winners of the actual Netix challenge. We used a hybrid approach: content-based Collaborative Filtering (item-item: between items calculated using people's ratings of the items - i.e. movies), in combination with Stochastic Gradient Descent.
We turned towards the extensive IMDb dataset which provided us with the opportunity to take factors such as the genre or the metascore of a given movie into account when building content-based methods. This resulted in ~ 15% improvement in score.
For each algorithm, we performed hyper-parameter tuning to choose the values for certain parameters that minimize the RMSE score. The most effective method found was the hybrid SGD-enhanced-content-CF.
- Luca Pantea
- Nikolaos Efthymiou