/imdbMoviesDSAI

How to Maximise Movie Success

Primary LanguageJupyter Notebook

How to Maximise Movie Success - Project Overview

Contributors: tripleH group

Codes are in:

  1. imdbFullAnalysis

  2. answeringInterestingQuestions

source: bespeaking.com

Our project is based on IMDB 5000 dataset found on kaggle

Content Section

Introduction

Have you ever wondered why some movies are more successful than others? If you're a movie director, you've came to the right place! If you are not a movie director, of course, you can still read on to find out more!

Problem Statement

Identify which features contribute to the success of a movie.

Motivation

Give directors a better estimation on how to maximize the success rate of their movie

Step 1: Looking at the Dataset

variable dtype variable dtype
color object actor_3_name object
director_name object facenumber_in_poster float64
num_critic_for_reviews float64 plot_keywords object
duration float64 movie_imdb_link object
director_facebook_likes float64 num_user_for_reviews float64
actor_3_facebook_likes float64 language object
actor_2_name object country object
actor_1_facebook_likes float64 content_rating object
gross float64 budget float64
genres object title_year float64
actor_1_name object actor_2_facebook_likes float64
movie_title object imdb_score float64
num_voted_users int64 aspect_ratio float64
cast_total_facebook_likes int64 movie_facebook_likes int64

There are a total of 28 variables.

Our hypotheses :

  • Duration will not affect IMDB scores
  • Variables related to popularity will have positive correlation with IMDB score
  • Budget will affect IMDB score

Step 2: Data Extraction and Data Cleaning

Train:Test:Validation

We split our dataset into 80:20 ratio, then used the 80% as our Train Dataset to further divide to obtain our Train: Validate for our Machine Learning Models in 80 : 20 ratio.

Cleaning the Train Dataset

  1. Issue with gross
    • We found out that gross was not standardized, as the dataset contained different types of gross for each movie. (e.g: opening week gross, US&Canada Gross etc.)
    • An example of disparities between the types of gross: gross
  2. Issue with budget
    • Different movies from different countries had different currencies for their budget.
    • Since the proportion of movies from other countries (besides US) was quite small, we decided to drop them.
    • We only used movies from USA.
    • Needed to standardize budget base on 2016 inflation rates in the US.
  3. Null Values
    • When # of null values are small for the variable, we dropped them.
    • Otherwise, for numerical data, we replaced them with median in scenarios such as during Machine Learning.
    • For categorical data, we dropped the rows.
  4. Train : Validate : Test
    • We followed the Train : Validate : Test scheme
    • Split Train:Test in 80:20 ratio
    • Used Train as our EDA
    • Further split Train into Train:Validate in 80:20 ratio for Machine Learning
  5. Binning imdb_scores
    • We wanted to observe the correlation not just in a numerical manner but also in a categorical manner.
    • Besides, since we couldn't really find any strong linear correlation (as you will read later on), we figured that it would be beneficial to split imdb_score into categories.
  # Bins to categorise the imdb_score ranges

  # Multi bins
  imdb_bins = [0, 3, 5, 7, 10]
  imdb_labels = ["horrendous", "ok", "good", "very good"]

  # Binary bins
  # 6.5 = 1-6.5 (Bad) 10 = 6.6-10 (Good)
  bins = (2, 6.5, 10)

TLDR:

  • we only used movies from the US, and standardised the budget based on 2016 inflation rates.
  • used Train : Validate : Test scheme
  • removed gross entirely due to inconsistency
  • we binned the imdb_score into categories, and tried out different bins.

Step 3: EDA

In this section, we will look at univariate and bivariate EDAs concerning more significant/ interesting variables.

Choosing our response variable

We have chosen imdb_score as our main response variable, for simplicity purposes. Initally, we wanted to use gross, but due to disparities, we decided not to.

1. director_name

these are the most frequently appeared directors.

director_name count
Steven Spielberg 22
Woody Allen 18
Clint Eastwood 17
Spike Lee 15
Ridley Scott 15
Martin Scorsese 15
Steven Soderbergh 12
Renny Harlin 12
Robert Zemeckis 12

It is interesting to note that Steven Spielberg is also one of directors from the Top20 performing movies

2. num_critic_for_reviews

  • a large proportion of movies receive close to 0 num_critic_for_reviews.
  • there is no significant linear correlation bewteen num_critic_for_reviews and imdb_score
  • the table below shows the movies sorted based on their num_critic_for_reviews, it does seem to show that imdb_score falls in a range of > 7.0 for these 20 movies.

    

  • to be fair, it may be that there is some sort of indication for imdb_score based on num_critic_for_reviews (as shown on the table), perhaps due to the large proportion of data receiving close to 0 reviews, we couldn't observe a linear correlation.

3. duration

duration vs imdb_score

  • We binned the imdb_score into categories to form score_cat
  • There seems to be slight correlation based on the boxplot between duration and score_cat

duration vs score

4. director_facebook_likes

  • Was extremely right-skewed even after removing the outliers, which is not unexpected, since "success" depends on outliers.
  • Due to the skew structure, we used log transform to visualise the data.
  • Distribution after log transform:
  • dir_facebook_likes_distribution_log
  • Binomial distribution, suggesting that there may be two different "clusters".

director_facebook_likes vs imdb_score

Although it can't be confirmed that there is a correlation between them, the boxplots shows that the median values of imdb_score do vary for the different categories.

dir_likes_vs_imdb_score

However, we note that the "good" and "very good" categories had relatively larger numbers of outliers, that had larger director_facebook_likes, this could possibly suggest that there is some correlation if we split them into subgroups to observe. (as we recall that there is binomial distribution)

5. genres

We had to split the strings into individual genres.

Screenshot 2022-04-22 at 2 56 12 PM          genreFreq

Observations

  • Most common genre : Drama

  • Is it because it is the most profitable?

  • this formed our hypothesis that: assuming that the movies industry follows demand and supply, there is high demand for Dramas, so this genre will be the most popular with the highest ratings amongst the other genres.

genres vs mean imdb_scores

We calculated the mean imdb_scores for each genre.

The results :

    genre_vs_score              cat_plot

It seems that Film-Noir has the highest imdb_score, however, this is inaccurate, as later, we find out that there were only 5 Film-Noir movies contributing to this observation. As noted here:

164624141-ac8d8373-3cff-4170-afb1-ffb19079eb77

6. budget

We decided to use only movies produced in the USA, so we could standardize the budget based on CPI (referenced: https://aarya1995.github.io/)

We performed web scraping using BeautifulSoup to obtain CPI data. Then, we updated the budget column of the whole dataset.

from bs4 import BeautifulSoup
import requests

budget vs imdb_score

Initially, we couldn't really see any pattern with only 2 and 4 imdb_score bins. So we split into 5 bins and saw a clearer picture. It does seem like higher budget can influence imdb_score. However, for the "horrendous" category, it seems like the budget used on them is higher.

This could mean that although budget does follow a certain trend as imdb_score increases, we ought to be careful with our budget as there is still a risk of the movie turning out to be "horrendous"

# the new bins (5 categories) we used 
# bins = [1,3,4,6,9,10], labels = ["horrendous", "very bad", "bad", "ok", "good"]

7. num_voted_users

  • Positively-skewed, large proportion had no number of voted users
  • Not much linear correlation either : with a correlation of -> 0.470567

            

8. imdb_score

A large proportion has imdb score of around 5-8.

The median of imdb_score is 6.5, which is why we chose one of our bins to be [0. 6.6, 10] (i.e. 0-6.5 will be classfied as "bad" and 6.6-10 as "good")

Multivariate EDA

The heat map shows that some variables affecting imdb_score are:
  • num_critic_for_reviews
  • duration
  • num_voted_users
  • num_user_for_reviews
  • movie_facebook likes

Step 4: Machine Learning

We explored several ML Models, the best-performing ML Model for our dataset turned out to be ..... Random Forest!

The list of models we used were:

  1. Linear Regression
  2. Logistic Regression
  3. K-Means
  4. Decision Tree
  5. Random Forest (Main)

Linear Regression

  • As expected, since our dataset is highly categorically-inclined, linear regression for both bivariate and multivariate LR had low R2 and MSE scores.
  • Below shows the scores of some bivariate LR that we attempted
  • Multivariate LR

Logistic Regression

  • Logistic Regression showed slightly better results, however, accuracy scores were not that high either.
  • This implies that there is not a "clear cut" between the datas.
  • Since there is some improvements, maybe a decision tree would show better results.
  • We performed Multivariate and Multiclass Logistic Regression

  • For mulvariate logR, scaling improved the accuracy score from 0.51 to 0.66

  • K-folds also improved:

    • multivariate accuracy scores from 0.51 to 0.63
    • multiclass accuracy scores from 0.65 to 0.66 (very slightly)
  • Multiclass logRegression showed better scores than binomial logRegression.

  • We also used other metrics like F1 scores and precision to observe our model.

K-Means

K-means is an unsupervised machine learning model. We found out that the optimal number of clusters is 3 (using elbow method)

The 2-D Grid, Parallel Coordinates Plot and Boxplot all show that budget is a huge determinant in influencing the split between the clusters!

      

Decision Tree

Train vs Validation results: had relatively good performance with 0.7 - 0.83 accuracy. This further confirms that our dataset is highly categorically inclined. However, train data had slightly better accuracy compared to validation, indicating that there may be slight overfitting issues.

Nevertheless, since the performance was good, we decided to use dectree on our Test dataset. Below shows the results.

Random Forest

Random Forest was the best! (Although, again, there may be slight overfitting for the same reasons as dectree + it took quite long to load)

Accuracies of :

  • Train Data = 0.96
  • Validation Data = 0.82
  • Test Data = 0.99

Feature importance in random forest shows how important each feature is in determining the decision the tree makes Below shows the feature importance for determining imdb_score.

feature importance

Turned out that num_voted_users, duration, num_user_for_reviews, num_critic_for_reviews and budget are the top5 determinants.

It is interesting to note that the variables that indicate popularity are : num_voted_users, num_user_for_reviews, and num_critic_for_reviews, and it is not unexpected for them to be determinants of success (imdb_scores).

Interesting Questions

Does the movie title length affect imdb scores?

Unfortunately, as much as we wanted to see some correlation, our bivariate EDA tells us that there isn't any correlation. See the boxplot below! However though, there is a somewhat normal distribution in the data, with a median length of 13

distribution of title_length title_length vs imdb_goodbad title_length vs imdb_cat

What are the personalities of directors of top performing movies?

We looked into the Top20 imdb_score movies and searched for their personalities online. Here are the results !

index director_name personality type index director_name personality type
0 Frank Darabont INFP 10 David Fincher INTJ
1 Francis Ford Coppola INTJ 11 Christopher Nolan INTJ
2 John Stockwell INFP 12 Peter Jackson ENFP
3 Christopher Nolan INTJ 13 Irvin Kershner INTP
4 Francis Ford Coppola INTJ 14 Mitchell Altieri n/a
5 Peter Jackson ENFJ 15 Lana Wachowski ENFP
6 Sergio Leone n/a 16 Cary Bell n/a
7 Steven Spielberg ISFP 17 Fernando Meirelles INFP
8 Quentin Tarantino ENTP 18 Milos Forman INTP
9 Robert Zemeckis ENFP 19 Akira Kurosawa INFJ

Observations: almost all of them (except for one - Steven Spielberg) have "N" in their personalities, which is the intuitive element.

Do you, as a movie director, have these personality traits too?

The Big Conclusion:

  1. Our outcomes show that decision tree and random forest are the most suitable machine learning models for our data set.
  2. This may be due to our dataset having skewed and imbalance data. Also, our dataset does not have very good linear relationships.
  3. Duration and budget of the movie are the top 5 features affecting imdb score.
  4. Popularity of the director and cast plays a role in determining imdb score.
  5. The top 3 genres affecting imdb score is drama, comedy and action. This aligns with our bi-variate eda as drama is one of the most representation genres affecting imdb_score.

So a movie director should pay close attention to the aforementioned factors.

Generally, based on our EDA and ML, movies with the following attributes will do better on the imdb rating score:

  • Higher duration
  • Higher budget
  • More popular director and cast
  • Movies with the genres of drama, comedy and/or action

Beyond our Course:

  • Standardising budget to 2016 inflation rate as the latest movies only go up to 2016

  • Web scraping

  • Visualisations:

    • 3D scatter plot & word cloud

          

  • Machine Learning:

  • K-modes & K-means

  • Logistic Regression

    • using Scaler() from sklearn
  • Random Forest

  • Feature Importance

  • Metrics

Limitations and Discussion:

  1. Analysis of personalities of the directors may be biased because they may be classified as those personalities based on their careers. Therefore, it may not be an accurate representation. However, it is still interesting to note their personalities!
  2. Further analysis can be done on other variables that indicate success through popularity or movie like director_facebook_likes, num_critic_for_reviews, num_voted_users
  3. Our dataset is quite imbalanced and skewed, therefore a larger dataset may help.

Workload Delegation:

1. Koh Zi En

2. Sandhiya Sukumaran

  • ML : Random Forest, Linear Regression
  • Presentation
  • EDA
  • Codes for EDA : EDA on mid 9 Variables
  • Data Visualisation

3. Yap Shen Hwei

Our Video:

Alt text

References: