/capstone

Primary LanguageJupyter Notebook

Capstone: TMDB US Box Office Prediction

(Artist: Serbinka From Shutterstock)

Table of Contents:


Problem Statement

According to the Domestic Movie Theatrical Market Summary 1995 to 2020, in 2019, there was about 1.24 billion Movie Tickets sold in the United States. Given that information, one could now assume that there must be a great deal of money going to the Movie Industry. So how do we, as Movie Lovers, determine this amount of cash flow? We will have to focus on the Movie's revenue. In other words, can we predict a Movie's United States Box Office revenue?

We could use a dataset with over 7,000 past films from The Movie Database created on the Kaggle website that has selected features. In particular, those features are cast, crew, plot keywords, budget, posters, release dates, languages, production companies, countries and others. Then we select the appropriate features to train various regression models with custom hyperparameters. Afterwards, we will use our models' $R^2$ scores to determine the best model to answer this inquiry. In other words, our goal is to assist production compaines with deciding how they should invest to gain a return.


Outside Research

On The Movie Database website, Moive Seekers can gain a lot of information about moives and television shows. If we want to gain access to details about a particular movie, we can either click on the movie's poster or search it up on the website's database. Once we have access to a particular movie, we are able to see the title, poster, user score, overview, full cast & crew, status, original language, runtime, budget, genres, keywords, content score, and revenue if already released. Also, we are able to see reviews and create discussions about a particular movie. In other words, it is a Movie Lovers dream. With that in mind, most of these factors can help us with determining how much money a movie will make after it is out in the United States Box Office.

To emphasize, according to Movies: What determines the success of a movie (by box office revenue)?, many factors can determine the Box Office success of a film such as the popularity of the film's content, the current popularity of the film's genre, the current popularity of the film's stars, the strength of the film's marketing campaign, and the strength of the film's distribution as well as its release schedule. As a result, popularity and budgeting seem like key factors that can influence a movie's revenue.

In addition, "Movies: What determines the success of a movie (by box office revenue)?" states that factors such as weather, holidays, distracting news events can limit the film's audience during the critical opening weekend. In that case, datetime seems like a key factor that also can influence a movie's revenue.

Also, "Movies: What determines the success of a movie (by box office revenue)?" states awareness of the film based upon it being adapted from a popular book or news story. For this reason, genres seem like a key factor that also can influence a movie's revenue.

Lastly, according to Study explores what really makes a movie successful, a key determinant of Box Office success is the number of screens where a movie is released. Thus, movie production companies need to budget for their advertisement as well. In this case, does the production company have enough money to sell to locations where movies are being shown, or have enough for advertising? For this reason, production companies seem like a key factor that also can influence a movie's revenue.

In sum, we will use this research to assist us on understanding our data fully throughout this Data Science Process.


Executive Summary

We began by importing the training dataset from the Kaggle's website. We renamed the dataset to movies because we wanted to clearly identify our objective from our problem statement. In the movies dataset, it had twenty four feature columns including our target variable revenue of a particular movie.

We began our data cleaning by dropping all the movies that were not made in the United States because we wanted to shape our dataset to answer our problem statement. Then we dropped the unrelated features that were not correlated with our problem statement. Also, we dealt with missing values and fixed the datetime feature. We extracted data from specific features as well because we noticed that some of our categorical features were in a dictionary format and we wanted some of those values as features in our dataset. Afterwards, we one-hot encoded on the categorical data. However, there would of been too many features in our dataset from creating these binary features, thus, we regulated some binary features. We created eight new features as a result. Lastly, we saved our clean movies dataset for future use.

Next, we were able to do some exploratory data analysis. First, we checked the summary statistics. Then we investigated the target variable revenue to see how it behaved; we looked at its distribution. As a result from our summary statistics analysis, we investigated selected univariate distributions. These univariate distributions were only the original numerical features because it was easier to see these distributions than our "created" numerical features. Then using a set of selected features, we explored correlations between these selected features and the target variable. We only selected an amount of features because not all of the binary features are ideal to see visually. To actually visually explore the other binary features, we looked at it's frequency counts. Afterwords, we investigated the datetime features to see if there were trends because of time. Lastly, we determined the outliers in our dataset.

Before we modeled, we needed to do some preparation. We began by creating our X features and y. Then, we train-test split. Lastly, we determined the baseline scores. We used the DummyRegressor model to obtain these values.

Finally, we were able to model. We modeled various regression models. We modeled Linear Regression with default hyperparameters and Ridge Regression, Lasso Regression, ElasticNet Regression, BaggedRegressor, & RandomForestRegressor with tuned hyperparameters. Then we supplemented the models with visualizations. We graphed the predictive values with respected to the actual values and explored some of the models' coefficients.

In the end, we focused on the $R^2$ scores, RMSE metrics, and the bias-variance tradeoff from each model to determine which model was the best to answer our problem statement.


Data Dictionary

We had an original data dictionary, yet, we did create new features into our datasets. Lets create a new dictionary.

Feature Datatype Discription
budget int64 The amount money spend to make and advertise the movie.
original_title object The movie's original name.
popularity float64 The likeability score for the movie. This is out of 100 percent.
release_date datetime64[ns] The date the movie was released to the public.
runtime float64 The full amount of the movie's runtime in minutes.
title object The name of the movie now.
month_release_date int64 The month the movie was release to the public.
year_release_date int64 The year the movie was release to the public.
disney_production_company int64 The list of Disney own production companies.
twenty_century_fox_production_company int64 The list of Twenty Century Fox own companies.
warner_bros_production_company int64 The list of Warner Bros own compaines.
nbcuniversal_production_company int64 The list of NBCUniversal own compaines.
sony_pictures_production_company int64 The list of Sony Pictures own compaines.
paramount_pictures_production_company int64 The list of Paramount Pictures own compaines.
top_twenty_influential_actors int64 The list of the top twenty influential actors in the industry.
top_twenty_keywords int64 The list of the top twenty keywords individuals use to identify movies.
genres_dummies int64 Dummy features of the genres feature. Genres are various forms of categories or classifications or groups of movies.
status_dummies int64 Dummy features of the status feature. Status tells us if the movie was released, rumored, or in post-production.
crew_departments_dummies int64 Dummy features of the crew_departments feature. Crew departments are various departments that worked on a movie.
revenue int64 The amount of money the movie has made after being released to the public.

Conclusions and Recommendations

Model Training $R^2$ Score Testing $R^2$ Score RMSE Training RMSE Testing
Baseline 0.000 -0.002 13889519 169751452
Linear 0.617 0.604 85935229 106672431
Ridge 0.617 0.591 86004081 108497311
Lasso 0.617 0.600 85935229 107705049
ElasticNet 0.616 0.600 86056471 108778885
BaggingRegressor 0.944 0.700 32899455 94545011
RandomForestRegressor 0.800 0.671 62142378 97225529

All of the regression models surpassed the baseline accuracy. Therefore, the best model was the RandomForestRegressor Model. According to the testing $R^2$ score, the RandomForestRegressor was able to manage well with unknown data. However, the model was still overfit because it had low bias and high variance.

Despite the overfitting, this model can be use to predict the revenue of a US movie given that we know the selected features of that particular movie. The top two features that production companies should consider to focus on to get a return on their investments should be the budget and popularity features. This is the case because these features were consistently the top two correlated features with the target variable. Also, production companies should consider investing in genre_adventure because that movie genre was our outliers with extremely high revenue and it was the third highest coefficient in our regularized models.

Yet, the RandomForestRegressor still had its limitations. We can not fully interpret it because it does not predict beyond the range of the training data. Also, it created an overfit on our dataset because it cannot handle the noise. In other words, additional noise features could hindered our model's results.

We can improve our model's $R^2$ scores, if we further tuned our hyperparameters mand eliminated more outliers.

In the end, we still have lingering questions we need to ask:

  • Can we use a Convolutional Neural Network on the image data as part of a transfer learning process to engineer additional features in our prediction model?
  • Can we feature engineer our features and target variable to optimized our predict model results? I.e. taking the logarithm.
  • Will this model still be valid 5 years from now when consumer preferences/trends change when it comes to movies? Given that we already seen different trends throughout the decades in our EDA.

Sources