In this project we are building a Prediction Model to predict Revenue generated by a movie based on movie details.
https://www.kaggle.com/datasets/akshaypawar7/millions-of-movies
The movie industry is a dynamic and complex environment where various factors influence a movie's success, particularly its revenue. This project focuses on developing a robust "Predictive Model for Movie Revenue Estimation and Decision Support." In an era of high uncertainty and financial risk in the film production business, understanding the determinants of movie revenue is paramount for informed decision-making, investment strategies, and marketing planning.
Movies are a unique combination of artistic expression and commercial ventures. To that end, this research investigates a multitude of features that contribute to the financial performance of a movie. These include aspects like genres, original languages, overview, popularity, production companies, release date, budget, runtime, status, tagline, vote average, vote count, credits, keywords, poster path, backdrop path, and recommendations. Each of these factors is scrutinized meticulously to unravel their individual and collective influence on movie revenue.
By conducting a comprehensive analysis of these features, this research aims to identify patterns, relationships, and correlations that can offer valuable insights into revenue prediction. The findings of this study are expected to have substantial implications for movie production companies, investors, and stakeholders. They can leverage the predictive model developed in this project to make informed decisions regarding their movie investments, marketing strategies, and financial planning, ultimately contributing to the success and profitability of their ventures in the film industry.
Which feature is highly correlated with revenue? Being a regression problem, what is the best metric to evaluate our model? Which regression model gives out best results while prediction? What is the root mean squared error, r2 and other metrics for predicting the revenue of a movie? By using Statistical tests and models learned, we will try to predict the revenue of movie based on independent features.
Reference guide for the columns in the dataset:
title: Movie or show name genres: Content categories or themes original_language: Language of the original content popularity: Measure of audience interest release_date: Date of public availability budget: Cost of production revenue: Income generated from content runtime: Duration of the content status: Current release status vote_average: Average audience ratings vote_count: Number of audience votes 12.trailer_views: Number of trailer views trailer_likes: Number of trailer likes This lets us identify which features are significant.
From the glossary, we have these subjective columns which are hard to interpret and use in modelling. So we remove them. - id - overview - production_companies - tagline - credits - keywords - poster_path - backdrop_path - recommendations It is not a good idea to use these features as these are just metadata and do not tell anything about how well movie will perform. This is how our dataset looks now.
Inferences:
Budget of Movies are having value 0, which is not possible (Remove rows with budget 0) Revenue has negative values which is again not possible (Remove rows with revenue 0 or negative) Runtime of Movies are having value 0, which is not possible (Remove rows with runtime 0) Check for outliers and influential points in all the columns. Runtime had 0.004% missing values. These values were dropped as number of missing value is small.
Inferences:
Trailer Likes and Trailer views have linear relationship with the Output/Dependent variable “Revenue”.
Inferences:
Trailer Likes and Trailer views has high correlation with “Revenue”. Vote Count and Budget have moderate correlation with “Revenue”. Popularity, Runtime and Average Vote have poor correlation with “Revenue”.
The features given in the final model using forward feature selection are Trailer_Likes, Trailer_Views, Vote_Count, Budget, Runtime ,Avg_Vote , popularity but we can see that the intercept is not significant hence we will build the model again with different combinations of features.
Rows 1,96,443,15996 have influence on our overall model and hence removed from the dataset.
In the above model we can see that all the input features are significant and we have a good Multiple R-Squared value. The p-value for the model is also significant indicating the model is a significant model. The Residual Std Error is around 3.9 Million which seems a reasonable as we cannot exactly predict the Revenue of a Movie as it depends on various other factors.
Coefficients:
Intercept: When interpreting from the linear regression model equation the following was noted. The intercept is estimated to be 601,900, which cannot be interpreted directly as each movie will make at least 601,900 dollars as one of the features is budget. Keeping budget 0 we cannot make 601,900 dollars.
Trailer_Likes: The variable “Trailer_Likes” has an estimated coefficient of 2.8 meaning every like on trailer of the movie increases the revenur by 2.8 dollars.
Trailer_Views: For the variable “Trailer_Views” the estimate is 0.259. This indicates that for every additional view of the movie trailer, the estimated revenue increases by $0.259.
Vote_count: The estimated coefficient for “Vote_count” is 265.7. This implies that for every additional vote, the estimated revenue increases by $265.7.
Budget: The variable “Budget” has an estimated coefficient of 0.003576. This suggests that for every additional dollar in the movie budget, the estimated revenue increases by $0.003576.
Multiple R-squared: The proportion of the variance in the dependent variable (Revenue) that is predictable from the independent variables (Trailer_Views, Trailer_Likes, Vote_Count, Budget). In our case, it’s 94.9%.
F-statistic: Tests the overall significance of the model. In our case, it’s 2.415e+04 with a very low p-value, suggesting the model is significant.
The Residual Standard Error has also reduced after removal of influential points.
Check of Mullticollinearity
All values are below 10 indicating “No” Multicollinearity between features.
Check for Linear Relationship with the output variable
All features in our model seem to have linear relationship with the output variable.
Check for Normality of errors, Equal Variances of residuals and Independence of errors
Now let’s try ridge and lasso regression to see how it will be different.
Firstly the lambda was set as 1. The results of ridge regression and lasso regression are as follows:
Ridge Regression: Coefs: Intercept, Popularity, Budget, Runtime, Avg_Vote, Vote_Count, Trailer_Views, Trailer_Likes 7.93^{4}, 1326.414, 0.006, 6035.994, 8.819^{4}, 523.23, 0.36, 2.426 RMSE: 4.969^{6} R-squared: 0.908
Lasso Regression: Coefs: Intercept, Popularity, Budget, Runtime, Avg_Vote, Vote_Count, Trailer_Views, Trailer_Likes 7.93^{4}, 1326.408, 0.006, 6035.975, 8.819^{4}, 523.23, 0.36, 2.426 RMSE: 4.969^{6} R-squared: 0.908
Then, lambda was assigned a series of values between 0.1 and ten squares, and cross-validation was used to determine the optimal lambda value.
Ridge Regression: Best lambda: 0.1 Coeffs: Intercept, Popularity, Budget, Runtime, Avg_Vote, Vote_Count, Trailer_Views, Trailer_Likes 7.963^{4}, 1326.651, 0.006, 6037.017, 8.812^{4}, 523.451, 0.36, 2.426 RMSE: 4.969^{6} R-squared: 0.908
Lasso Regression: Best lambda: 3.941^{5} Coeffs: Intercept, Popularity, Budget, Runtime, Avg_Vote, Vote_Count, Trailer_Views, Trailer_Likes 1.512^{6}, 0, 0.002, 0, 0, 481.064, 0.334, 2.423 RMSE: 5.017^{6} R-squared: 0.906
Finally, let’s use different value ranges to determine the optimal lambda values of ridge regression and lasso regression respectively, since their optimal lambda values are so different.
Ridge Regression: Best lambda: 9.908 Coeffs: Intercept, Popularity, Budget, Runtime, Avg_Vote, Vote_Count, Trailer_Views, Trailer_Likes 7.962^{4}, 1326.65, 0.006, 6037.013, 8.812^{4}, 523.448, 0.36, 2.426 RMSE: 4.969^{6} R-squared: 0.908
Lasso Regression: Best lambda: 4.014^{5} Coeffs: Intercept, Popularity, Budget, Runtime, Avg_Vote, Vote_Count, Trailer_Views, Trailer_Likes 1.518^{6}, 0, 0.002, 0, 0, 479.69, 0.333, 2.423 RMSE: 5.019^{6} R-squared: 0.906
We can see that the lasso regression does change the coefficients of some features to zero. And surprisingly, this is just the same result as the stepwise forward feature selection.
Starting by performing a 4:1 Train-test Split of Dataset. Then we go ahead and build the decision tree regression model.
The regression tree was built using the specified formula with “Trailer_Likes” as the primary variable and using all present features. The tree’s complexity is controlled by the complexity parameter, and the cross-validated error is used to assess the model’s performance. The reported RMSE of 6.419^{6} gives an estimate of the average prediction error on the training data. The CP values suggests that further pruning beyond a certain point (determined by the complexity parameter) may not significantly improve model performance.
The model1 was created using significant factors. The model1 analysis reveals valuable insights into the factors influencing revenue in the examined dataset. The constructed regression tree, utilizing variables such as Budget, Vote_Count, Trailer_Views, and Trailer_Likes, prioritizes Trailer_Likes as the primary determinant of revenue. The tree’s structure, with a depth of 6 nodes, provides a detailed segmentation of the dataset, emphasizing the significance of Trailer_Likes in predicting revenue variations.
Node analysis showcases the mean revenue and mean squared error (MSE) at each node, offering a granular understanding of the model’s predictions. The root mean squared error (RMSE) of 6.419^{6} provides a measure of the overall model accuracy.
Variable importance ranking underscores the dominance of Trailer_Likes, followed by Trailer_Views, Vote_Count, and Budget. This suggests that while all variables contribute to revenue prediction, Trailer_Likes plays a pivotal role.
The model interpretation underscores the actionable insight that increasing Trailer_Likes is crucial for revenue enhancement. Lower Trailer_Likes correspond to diminished predicted revenue, emphasizing the marketing and promotional efforts’ potential impact on revenue generation.
The regression tree analysis was conducted on the “Revenue” variable using the predictors “Budget,” “Vote_Count,” “Trailer_Views,” and “Trailer_Likes” with a maximum depth limited to 1. The key findings are as follows:
The root node error for the model is 3e+14, based on 8528 observations. The tree structure indicates that only the variable “Trailer_Likes” is utilized in the construction of the tree. The primary split occurs at a value of 6120000 for “Trailer_Likes.”
The resulting tree consists of two terminal nodes (Node 2 and Node 3). Node 2 represents observations with lower Trailer_Likes values, yielding a mean revenue of 6.26e+06 and a relatively lower MSE. On the other hand, Node 3, with higher Trailer_Likes values, has a mean revenue of 3.74e+07 and a higher MSE.
The overall Root Mean Squared Error (RMSE) of the model is 1.242^{7}.
The variable importance ranking suggests that “Trailer_Likes” is the most influential predictor, followed by “Trailer_Views,” “Vote_Count,” and “Budget.”
A similar tree was built with max_depth=2 giving bad results.
The random forest regression analysis was conducted with the formula Revenue ~ Budget + Vote_Count + Trailer_Views + Trailer_Likes, utilizing 500 trees in the forest. The model’s type is set to regression, and it tried one variable at each split. The mean of squared residuals, a measure of prediction error, is 3.07e+13, indicating the average squared difference between predicted and actual values. Additionally, the random forest explains approximately 90% of the variance in the Revenue variable, showcasing its substantial predictive power.
The Root Mean Squared Error (RMSE) for the random forest model is 5.48^{6}, which represents the average difference between the predicted and actual revenue values. A lower RMSE indicates a more accurate model. In this case, the relatively low RMSE suggests that the random forest model provides a good fit to the training data.
The utilization of a random forest, which combines predictions from multiple decision trees, often results in a robust and accurate predictive model. The % Var explained value of 90 suggests that the model effectively captures the underlying patterns in the data, demonstrating a high level of explanatory capability. This performance makes the random forest a promising tool for predicting Revenue based on the provided predictor variables. Overall, the random forest regression model appears to be a strong and reliable approach for predicting Revenue, providing valuable insights for decision-making in scenarios involving budget, vote count, trailer views, and trailer likes.
The project aimed to address the complex and dynamic nature of the movie industry by developing a regression model to estimate movie revenue. The data analysis and model building process involved exploratory data analysis, data cleaning, and feature selection.
The project addressed SMART questions, identified influential points, and conducted rigorous statistical tests. After feature selection and model building the best R^2 value achieved was 0.92 with the features Trailer Views, Trailer Likes, Vote Count, Budget
The linear regression model, after thorough feature selection, highlighted the significance of Trailer_Likes, Trailer_Views, Vote_Count, and Budget in predicting movie revenue. The model achieved a Multiple R-squared value of 94.9%, indicating its strong explanatory power.
The results of ridge and lasso regression were slightly worse than linear regression, but the results of lasso regression verified the correctness of stepwise forward feature selection.
The project successfully navigated the complexities of movie revenue estimation, leveraging advanced statistical models to provide actionable insights for stakeholders in the movie production business.
The combination of linear regression, decision tree regression, and random forest regression offered a comprehensive understanding of the factors influencing movie revenue and paved the way for informed decision support in the industry with an average RMSE of 3.5 million dollars. Undoubtedly, the linear regression model performs better than trees.
The model can be improved using boosting methods (Ensemble Methods) to get a better performance.