- Dan Hales, danhalesprogramming@gmail.com
- Nick Pardue, nickpardue@gmail.com
We decided to create a movie recommendation website to compete with Netflix. As such, we needed a way of recommending movies to users who visit our site, based off of movies they've previously seen and rated.
We were given data sourced from MovieLens. Using this data, we sought to find out the following:
- Do popular movies tend to be more highly rated?
- How do users tend to rate movies?
- On average, are there any genres that are rated higher than others?
- Are there any genres that tend to get more ratings?
Our Sourced Data
We analyzed a dataset of 100k movie ratings from 610 users, across just under 10k movies. Each contained the following relevant variables:
- movieId
- A numerical identifier of the movie being rated.
- title
- The title of the movie being rated.
- userId
- The user who submitted the rating.
- rating
- The rating a user gave to a movie, ranging between 0.5 and 5.
- genres
- The genre(s) of the movie being rated.
- We found that, yes, movies with a higher number of ratings tend to have a higher average score (between a 3 and 4).
- We found that users tend to rate movie movies around 3.5 to 4. Additionally, we the vast majority of users will rate less than 100 movies, and as users rate more movies, their average rating tends to converge towards the aforementioned average.
- Looking at the average rating of each genre, there seems to be no indication of a genre that tends to be rated significantly higher than others.
- After ruling out a genre's average rating as a factor of which type of movies we should seek out for expanding our library, we decided to look into total number of ratings for each genre. We assume that this will give an accurate representation of what kinds of movies users tend to watch. Consulting the chart, we will continue to seek out movies that fall under a mixture of the Drama, Comedy, Action, Adventure, and Thriller genres.cn
- We split our original data set into a training and a test set (80:20), and then used our new model on the training set. We found out that in predicting the rating of a movie our user has not seen yet will fall within 0.70 of the user's true rating. With the test data, our predictions fell within 0.89 points of the user's true rating. We feel as if this is an acceptable amount of error because the difference in a 4-star rating and a 5-star is not that much.
- Here's an example of a given user's rating from the test data. The ratings for Toy Story and Toy Story 3 were outside of the 0.89 range of error, but the other predictions were spot on.
- We would like to gather more users and their ratings, so that we can update our model and provide more accurate predictions.
- Gather more users and ratings.
- Implement user keywords for the creation of new features.
- Implicit data collection to see how many times a user watches a movie, how much of a movie they watch, info on binge-watching, etc.
Download the MovieLens 100k dataset, and then follow along with our Executive Notebook to produce a copy of our results.
├── Images
│ ├── .DS_Store
│ ├── genre_num.png
│ ├── genre_rating.png
│ ├── nextflik.jpeg
│ ├── pop_movies.png
│ ├── results.png
│ ├── rmse.png
│ └── user_ratings.png
│
├── .DS_Store
│
├── Movie Recommender.pdf
│
├── README.md
│
├── data.zip
│
└── index.ipynb
Sources can be found in our presentation's speaker notes.