A content-based recommender system that recommends movies similar to the movie the user enter
When we click on a movie on Netflix to watch then it recommend us so many movies relate to the movie that we click on. We wondered ourself that how they can do that and what are the keys. These factors motivate our group to do this project for the Data Science Fundamental class.
The objective of this project is to practice using Jupiter Notebook. The movies recommendation are based on the content of the movie you entered or selected. The main parameters that are considered for the recommendations are the movie_id, the titles, and the tags. The tags of the movies include genres, keywords, overview, cast, crew. The details of the movies are fetched from TMDB.[1]
The model processing and training are conducted using Jupiter Notebook.
There are two datasets:
tmdb_5000_movies.csv
. This dataset includes 20 features such as 'budget', 'genres', 'homepage','id','keywords',...
tmdb_5000_credits.csv
. This dataset includes 4 features such as 'movie_id', 'title', 'cast', 'crew'.
The dataset can be found at kaggle[2]
- NumPy, Pandas, nltk and scikit-learn for data cleaning and building model
- Matplotlib (pylot) for visualizing features
- ast (literal_eval) for evaluating a string that contains a Python list, and convert it to a list object
- Pickle (dump, load) for converting a Python object into a byte stream to store it in a file/database, maintain program state across sessions, or transport data over the network.
- Streamlit[3] (st) for wep app design
- GitHub and Heroku for web app deployment and hosting/version control
- VS Code as IDE, jupyter notebook
- Modules: stem.porter
- Classes: PorterStemmer
Removing the commoner morphological and inflexional endings from words in English.
- action, actions, acting -> act
- aid, aided, aids -> aid
- Modules: feature_extraction.text, metrics.pairwise
- Classes: CountVectorizer (stop word), cosine_similarity
- Word vectorization - words to vectors (list of number)
- Using Bag of Words ( simplifying representation for NLP), representing words by their frequencies
- movie_1 : action movie (tags)
- movie_2 : SciFi movie (tags)
- movie_3 : adventure movie (tags)
Cosine similarity is a metric used to measure how similar the documents are irrespective of their size. Mathematically, it measures the cosine of the angle between two vectors projected in a multi-dimensional space. The cosine similarity is advantageous because even if the two similar documents are far apart by the Euclidean distance (due to the size of the document), chances are they may still be oriented closer together. The smaller the angle, higher the cosine similarity.
Distance = 1/similarity
Cosine similarity = 1 – cosine distance
- if angle = 5 => two movies are considered almost the same
- if angle = 90 => two movies are considered to be haft the same
- if angle = 180 => two movies are considered differently with each other’s
More about Cosine Similarity:
Understanding the Math behind Cosine Similarity[4]
How does it decide which movie is most similar to the movie user enter(or selects)? Here comes the similarity scores.
It is a numerical value ranges between zero to one which helps to determine how much two items are similar to each other on a scale of zero to one. This similarity score is obtained measuring the similarity between the text details of both of the items. So, similarity score is the measure of similarity between given text details of two items. This can be done by cosine-similarity.
Check out the live demo: https://movies-recommender-vod6.herokuapp.com/
Create an account in the movie database[5] Once you successfully created an account, click on the API
link from the left hand sidebar in your account settings and fill all the details to apply for an API key. If you are asked for the website URL, just give "NA" if you don't have one. You will see the API key in your API
sidebar once your request has been approved.
- Clone or download this repository to your local machine.
- Install all the libraries mentioned in the
requirements.txt
file with the commandpip install -r requirements.txt
- Get your API key from https://www.themoviedb.org/. (Refer the above section on how to get the API key)
- Replace YOUR_API_KEY in
TMDB_API_KEY
- Open your terminal/command prompt from your project directory and run the file
app.py
by executing the commandstreamlit run app.py
. - Hurray! That's it.
- Suggestion user which movies has related and waste less time browsing the next movie which the same genres.
- Searching engine for us and another research using the difference matrix factorizations
- Future Research is making rating for movie and analyze the leaving the users’ commend to improve our movie recommendation’ s accuracy.
- Abstract each movie as the data point and used the cosine similarity to giving the recommendation for each movie
- Representation and visualization the result as the web-page
- Large and heavy dataset giving us the issue about the run-time complexity
[1] TMDB
[2] movie dataset
[3] Streamlit