by Peter Hontaru
This document is a report from the final course project for the Linear regression course, as part of the Duke University Statistics with R course in partnership with Coursera.
This project is based on a fictitious scenario where I’ve been hired as a data scientist at Paramount Pictures. The data presents numerous variables on movies such as audience/critic ratings, number of votes, runtime, genre, etc. Paramount Pictures is looking to gather insights into determining the acclaim of a film and other novel patterns or ideas.
- our linear model was only able to capture 29% of the variability in our data
- the model significantly over-predicts for low IMDB ratings and over-predicts for high IMDB ratings
- while we were able to identify significant predictors, the design of the study does not allow for causation
- critics and audiences tend to differ in their movie taste, particularly in categories such as Comedy and Action & Adventure
- some accolades are better predictors of IMDB ratings than others
- Friday might be the best day to release a movie in terms of audience access, but the movies are not the most popular (as judged by total IMDB votes or IMDB rating)
- there is an exponential relationship between the number of votes a movie receives and the IMDB rating, where highly rated movies are much more popular than normally expected
- further optimisation can be made if we have a specific goal to make
a movie for either critics or audience
- similarly, popularity could also be determined by total number of votes (quantity) rather than rating (quality)
- relatively low sample sizes before 1990; our data is not representative of each year included in the dataset(but it is representative overall)
- we could use more data on factors we would normally have access before the release of the film: budget, actor/director social media influence (which might affect ratings or popularity), etc
- other types of models could be explored, particularly exponential ones
- some techniques were specifically used in the context of the
course - more robust methods are available such as:
- R adjustment, as it is known to be more reliable than an arbitrary p value selection
- train/test split of our dataset
- automated backward/forward model selection (ie. through Caret), since a manual approach is susceptible to human error
- no data pre-processing
The data set is comprised of 651 randomly sampled movies produced and released before 2016.
Source:
- Rotten Tomatoes: https://www.rottentomatoes.com/
- IMDB API: https://www.imdb.com/
Full project available:
- RECOMMENDED: at the following link, in HTML format
- in the Movie rating.md file of this repo (however, recommend previewing it at the link above since it was originally designed as a html document)