(Artist: Serbinka From Shutterstock)
- Problem Statement
- Outside Research
- Executive Summary
- Data Dictionary
- Conclusions and Recommendations
- Sources
According to the Domestic Movie Theatrical Market Summary 1995 to 2020, in 2019, there was about 1.24 billion Movie Tickets sold in the United States. Given that information, one could now assume that there must be a great deal of money going to the Movie Industry. So how do we, as Movie Lovers, determine this amount of cash flow? We will have to focus on the Movie's revenue. In other words, can we predict a Movie's United States Box Office revenue?
We could use a dataset with over 7,000 past films from The Movie Database created on the Kaggle website that has selected features. In particular, those features are cast, crew, plot keywords, budget, posters, release dates, languages, production companies, countries and others. Then we select the appropriate features to train various regression models with custom hyperparameters. Afterwards, we will use our models'
On The Movie Database website, Moive Seekers can gain a lot of information about moives and television shows. If we want to gain access to details about a particular movie, we can either click on the movie's poster or search it up on the website's database. Once we have access to a particular movie, we are able to see the title, poster, user score, overview, full cast & crew, status, original language, runtime, budget, genres, keywords, content score, and revenue if already released. Also, we are able to see reviews and create discussions about a particular movie. In other words, it is a Movie Lovers dream. With that in mind, most of these factors can help us with determining how much money a movie will make after it is out in the United States Box Office.
To emphasize, according to Movies: What determines the success of a movie (by box office revenue)?, many factors can determine the Box Office success of a film such as the popularity of the film's content, the current popularity of the film's genre, the current popularity of the film's stars, the strength of the film's marketing campaign, and the strength of the film's distribution as well as its release schedule. As a result, popularity and budgeting seem like key factors that can influence a movie's revenue.
In addition, "Movies: What determines the success of a movie (by box office revenue)?" states that factors such as weather, holidays, distracting news events can limit the film's audience during the critical opening weekend. In that case, datetime seems like a key factor that also can influence a movie's revenue.
Also, "Movies: What determines the success of a movie (by box office revenue)?" states awareness of the film based upon it being adapted from a popular book or news story. For this reason, genres seem like a key factor that also can influence a movie's revenue.
Lastly, according to Study explores what really makes a movie successful, a key determinant of Box Office success is the number of screens where a movie is released. Thus, movie production companies need to budget for their advertisement as well. In this case, does the production company have enough money to sell to locations where movies are being shown, or have enough for advertising? For this reason, production companies seem like a key factor that also can influence a movie's revenue.
In sum, we will use this research to assist us on understanding our data fully throughout this Data Science Process.
We began by importing the training dataset from the Kaggle's website. We renamed the dataset to movies because we wanted to clearly identify our objective from our problem statement. In the movies dataset, it had twenty four feature columns including our target variable revenue
of a particular movie.
We began our data cleaning by dropping all the movies that were not made in the United States because we wanted to shape our dataset to answer our problem statement. Then we dropped the unrelated features that were not correlated with our problem statement. Also, we dealt with missing values and fixed the datetime feature. We extracted data from specific features as well because we noticed that some of our categorical features were in a dictionary format and we wanted some of those values as features in our dataset. Afterwards, we one-hot encoded on the categorical data. However, there would of been too many features in our dataset from creating these binary features, thus, we regulated some binary features. We created eight new features as a result. Lastly, we saved our clean movies dataset for future use.
Next, we were able to do some exploratory data analysis. First, we checked the summary statistics. Then we investigated the target variable revenue
to see how it behaved; we looked at its distribution. As a result from our summary statistics analysis, we investigated selected univariate distributions. These univariate distributions were only the original numerical features because it was easier to see these distributions than our "created" numerical features. Then using a set of selected features, we explored correlations between these selected features and the target variable. We only selected an amount of features because not all of the binary features are ideal to see visually. To actually visually explore the other binary features, we looked at it's frequency counts. Afterwords, we investigated the datetime features to see if there were trends because of time. Lastly, we determined the outliers in our dataset.
Before we modeled, we needed to do some preparation. We began by creating our X features and y. Then, we train-test split. Lastly, we determined the baseline scores. We used the DummyRegressor model to obtain these values.
Finally, we were able to model. We modeled various regression models. We modeled Linear Regression with default hyperparameters and Ridge Regression, Lasso Regression, ElasticNet Regression, BaggedRegressor, & RandomForestRegressor with tuned hyperparameters. Then we supplemented the models with visualizations. We graphed the predictive values with respected to the actual values and explored some of the models' coefficients.
In the end, we focused on the
We had an original data dictionary, yet, we did create new features into our datasets. Lets create a new dictionary.
Feature | Datatype | Discription |
---|---|---|
budget | int64 | The amount money spend to make and advertise the movie. |
original_title | object | The movie's original name. |
popularity | float64 | The likeability score for the movie. This is out of 100 percent. |
release_date | datetime64[ns] | The date the movie was released to the public. |
runtime | float64 | The full amount of the movie's runtime in minutes. |
title | object | The name of the movie now. |
month_release_date | int64 | The month the movie was release to the public. |
year_release_date | int64 | The year the movie was release to the public. |
disney_production_company | int64 | The list of Disney own production companies. |
twenty_century_fox_production_company | int64 | The list of Twenty Century Fox own companies. |
warner_bros_production_company | int64 | The list of Warner Bros own compaines. |
nbcuniversal_production_company | int64 | The list of NBCUniversal own compaines. |
sony_pictures_production_company | int64 | The list of Sony Pictures own compaines. |
paramount_pictures_production_company | int64 | The list of Paramount Pictures own compaines. |
top_twenty_influential_actors | int64 | The list of the top twenty influential actors in the industry. |
top_twenty_keywords | int64 | The list of the top twenty keywords individuals use to identify movies. |
genres_dummies | int64 | Dummy features of the genres feature. Genres are various forms of categories or classifications or groups of movies. |
status_dummies | int64 | Dummy features of the status feature. Status tells us if the movie was released, rumored, or in post-production. |
crew_departments_dummies | int64 | Dummy features of the crew_departments feature. Crew departments are various departments that worked on a movie. |
revenue | int64 | The amount of money the movie has made after being released to the public. |
Model | Training |
Testing |
RMSE Training | RMSE Testing |
---|---|---|---|---|
Baseline | 0.000 | -0.002 | 13889519 | 169751452 |
Linear | 0.617 | 0.604 | 85935229 | 106672431 |
Ridge | 0.617 | 0.591 | 86004081 | 108497311 |
Lasso | 0.617 | 0.600 | 85935229 | 107705049 |
ElasticNet | 0.616 | 0.600 | 86056471 | 108778885 |
BaggingRegressor | 0.944 | 0.700 | 32899455 | 94545011 |
RandomForestRegressor | 0.800 | 0.671 | 62142378 | 97225529 |
All of the regression models surpassed the baseline accuracy. Therefore, the best model was the RandomForestRegressor Model. According to the testing
Despite the overfitting, this model can be use to predict the revenue of a US movie given that we know the selected features of that particular movie. The top two features that production companies should consider to focus on to get a return on their investments should be the budget
and popularity
features. This is the case because these features were consistently the top two correlated features with the target variable. Also, production companies should consider investing in genre_adventure
because that movie genre was our outliers with extremely high revenue and it was the third highest coefficient in our regularized models.
Yet, the RandomForestRegressor still had its limitations. We can not fully interpret it because it does not predict beyond the range of the training data. Also, it created an overfit on our dataset because it cannot handle the noise. In other words, additional noise features could hindered our model's results.
We can improve our model's
In the end, we still have lingering questions we need to ask:
- Can we use a Convolutional Neural Network on the image data as part of a transfer learning process to engineer additional features in our prediction model?
- Can we feature engineer our features and target variable to optimized our predict model results? I.e. taking the logarithm.
- Will this model still be valid 5 years from now when consumer preferences/trends change when it comes to movies? Given that we already seen different trends throughout the decades in our EDA.
- Domestic Movie Theatrical Market Summary 1995 to 2020
- TMDB Box Office Prediction
- The Movie Database
- Movies: What determines the success of a movie (by box office revenue)?
- Study explores what really makes a movie successful
- Google Web Interface and Search Language Codes
- IMDB website
- Boom or Bust? Factors that Influence Box Office Revenue
- EVERY COMPANY DISNEY OWNS: A MAP OF DISNEY'S WORLDWIDE ASSETS
- List of assets owned by 21st Century Fox
- List of assets owned by WarnerMedia
- List of assets owned by NBCUniversal
- Sony Pictures Entertainment Motion Picture Group
- Paramount Pictures
- Movie Keywords From The Numbers
- Here Are The 20 Richest Actors In The World & Their Net Worth
- Fourth Pirates Of The Caribbean Is Most Expensive Movie Ever With Costs Of 410 Million Dollars
- Why Wonder Woman Is The Best DCEU Movie So Far
- Average movie length
- Best Business Decisions Made by Actors
- Advice from Robert Downey Jr
- Quantitative analysis of the evolution of novelty in cinema through crowdsourced keywords
- 10 Reasons Why Everyone Has Seen a Superhero Movie
- Popular Sci-Fi Films: What Makes Them So Great?
- Hollywood’s Obsession with Blockbusters
- Genre trends in global film production
- Who earns more: a director or an actor of a movie?
- Rethinking the Seasonal Strategy
- Dump months
- How Are Movie Release Dates Chosen?