This project is part of the Openclassrooms DataScientist Curriculum together with CentraleSupélec
For this project, we could use an IMDB movie database with different features of over 5000 movies (such as cost, earnings, rating,…). The goal was here given the title of a movie to recommend 5 other movies interesting for the user.
Feature engineering was needed for example for the content_rating to join all the redundant ratings together. We added also the revenue, being the difference between gross and budget as new feature, aswell as success being the ratio between revenue and budget.
After having cleaned the data, and before moving to the recommendation system itself, we first did some exploration to get a better understanding of our data.
Some interesting insights were:
- Even if movies in this DB are from 1927 till 2016, 75% of them were produced after 1999.
- Logically 96% of the movies are in color and almost 80% of them were produced in the US, with as consequence that more than 95% of the movies are in English.
- The longest movie of the DB is « Blood in, Blood Out » lasting over 5h30min.
- The movie having the biggest gross is "Avatar" with 760M$
- The average IMDB_score is round 6.5 and the movie with the best score (of 9.3) is « The ShawShank Redemption »
- The 2 movies with the biggest success ratio (revenue/budget) are « Paranormal Activity » and « the Blair With Project »
- We tried several dimensional reduction technics on the cleaned data, in order to allow proper visualisation and to improve the performances of the algorithms. The PCA technic didn’t seem well suited in our case so we used t-SNE.
- From the reduced data, we applied a clustering algorithm. Here we used K-means++ and used the « silhouette coefficient » to find the proper amount of clusters.
- Within the cluster corresponding to the movie we are searching recommendations for, we searched similarities intracluster. Here we used « cosine similarity » as measure.
So our recommendation system works finally as follows:
- Find the cluster label related to the movie we are searching for
- Take the complete cluster corresponding to this label
- Within this cluster, compute the cosine similarity
- Take the 5 movies with the highest similarity scores.
Here we did what’s called « Content Filtering », everyone is getting the same recommendations. This is opposed to « collaborative Filtering », where we need also information about the users.