How to Maximise Movie Success - Project Overview

Contributors: tripleH group

Sandhiya Sukumaran @sandhiyaaa
Koh Zhi En @zex3
Yap Shen Hwei @imaginaryBuddy

Codes are in:

Our project is based on IMDB 5000 dataset found on kaggle

Content Section

Introduction
- Problem Statement
- Motivation
Steps:
- 1 : Looking at the Dataset
  - Our Hypotheses
- 2 : Data Extraction & Cleaning
- 3 : Exploratory Data Analysis
- 4 : Machine Learning
Interesting Questions
The Big Conclusion
Beyond Our Course
Limitations and Discussion
Workload Delegation
Our Video
References

Introduction

Have you ever wondered why some movies are more successful than others? If you're a movie director, you've came to the right place! If you are not a movie director, of course, you can still read on to find out more!

Problem Statement

Identify which features contribute to the success of a movie.

Motivation

Give directors a better estimation on how to maximize the success rate of their movie

Step 1: Looking at the Dataset

variable	dtype	variable	dtype
color	object	actor_3_name	object
director_name	object	facenumber_in_poster	float64
num_critic_for_reviews	float64	plot_keywords	object
duration	float64	movie_imdb_link	object
director_facebook_likes	float64	num_user_for_reviews	float64
actor_3_facebook_likes	float64	language	object
actor_2_name	object	country	object
actor_1_facebook_likes	float64	content_rating	object
gross	float64	budget	float64
genres	object	title_year	float64
actor_1_name	object	actor_2_facebook_likes	float64
movie_title	object	imdb_score	float64
num_voted_users	int64	aspect_ratio	float64
cast_total_facebook_likes	int64	movie_facebook_likes	int64

There are a total of 28 variables.

Our hypotheses :

Duration will not affect IMDB scores
Variables related to popularity will have positive correlation with IMDB score
Budget will affect IMDB score

Step 2: Data Extraction and Data Cleaning

Train:Test:Validation

We split our dataset into 80:20 ratio, then used the 80% as our Train Dataset to further divide to obtain our Train: Validate for our Machine Learning Models in 80 : 20 ratio.

Cleaning the Train Dataset

Issue with gross
- We found out that gross was not standardized, as the dataset contained different types of gross for each movie. (e.g: opening week gross, US&Canada Gross etc.)
- An example of disparities between the types of gross:
Issue with budget
- Different movies from different countries had different currencies for their budget.
- Since the proportion of movies from other countries (besides US) was quite small, we decided to drop them.
- We only used movies from USA.
- Needed to standardize budget base on 2016 inflation rates in the US.
Null Values
- When # of null values are small for the variable, we dropped them.
- Otherwise, for numerical data, we replaced them with median in scenarios such as during Machine Learning.
- For categorical data, we dropped the rows.
Train : Validate : Test
- We followed the Train : Validate : Test scheme
- Split Train:Test in 80:20 ratio
- Used Train as our EDA
- Further split Train into Train:Validate in 80:20 ratio for Machine Learning
Binning imdb_scores
- We wanted to observe the correlation not just in a numerical manner but also in a categorical manner.
- Besides, since we couldn't really find any strong linear correlation (as you will read later on), we figured that it would be beneficial to split imdb_score into categories.

  # Bins to categorise the imdb_score ranges

  # Multi bins
  imdb_bins = [0, 3, 5, 7, 10]
  imdb_labels = ["horrendous", "ok", "good", "very good"]

  # Binary bins
  # 6.5 = 1-6.5 (Bad) 10 = 6.6-10 (Good)
  bins = (2, 6.5, 10)

TLDR:

we only used movies from the US, and standardised the budget based on 2016 inflation rates.

used Train : Validate : Test scheme

removed gross entirely due to inconsistency

we binned the imdb_score into categories, and tried out different bins.

Step 3: EDA

In this section, we will look at univariate and bivariate EDAs concerning more significant/ interesting variables.

Choosing our response variable

We have chosen imdb_score as our main response variable, for simplicity purposes. Initally, we wanted to use gross, but due to disparities, we decided not to.

1. director_name

these are the most frequently appeared directors.

director_name	count
Steven Spielberg	22
Woody Allen	18
Clint Eastwood	17
Spike Lee	15
Ridley Scott	15
Martin Scorsese	15
Steven Soderbergh	12
Renny Harlin	12
Robert Zemeckis	12

It is interesting to note that Steven Spielberg is also one of directors from the Top20 performing movies

2. num_critic_for_reviews

a large proportion of movies receive close to 0 num_critic_for_reviews.
there is no significant linear correlation bewteen num_critic_for_reviews and imdb_score
the table below shows the movies sorted based on their num_critic_for_reviews, it does seem to show that imdb_score falls in a range of > 7.0 for these 20 movies.

to be fair, it may be that there is some sort of indication for imdb_score based on num_critic_for_reviews (as shown on the table), perhaps due to the large proportion of data receiving close to 0 reviews, we couldn't observe a linear correlation.

3. duration

duration vs imdb_score

We binned the imdb_score into categories to form score_cat
There seems to be slight correlation based on the boxplot between duration and score_cat

4. director_facebook_likes

Was extremely right-skewed even after removing the outliers, which is not unexpected, since "success" depends on outliers.

Due to the skew structure, we used log transform to visualise the data.
Distribution after log transform:
Binomial distribution, suggesting that there may be two different "clusters".

director_facebook_likes vs imdb_score

Although it can't be confirmed that there is a correlation between them, the boxplots shows that the median values of imdb_score do vary for the different categories.

However, we note that the "good" and "very good" categories had relatively larger numbers of outliers, that had larger director_facebook_likes, this could possibly suggest that there is some correlation if we split them into subgroups to observe. (as we recall that there is binomial distribution)

5. genres

We had to split the strings into individual genres.

Observations

Most common genre : Drama
Is it because it is the most profitable?
this formed our hypothesis that: assuming that the movies industry follows demand and supply, there is high demand for Dramas, so this genre will be the most popular with the highest ratings amongst the other genres.

genres vs mean imdb_scores

We calculated the mean imdb_scores for each genre.

The results :

It seems that Film-Noir has the highest imdb_score, however, this is inaccurate, as later, we find out that there were only 5 Film-Noir movies contributing to this observation. As noted here:

6. budget

We decided to use only movies produced in the USA, so we could standardize the budget based on CPI (referenced: https://aarya1995.github.io/)

We performed web scraping using BeautifulSoup to obtain CPI data. Then, we updated the budget column of the whole dataset.

from bs4 import BeautifulSoup
import requests

budget vs imdb_score

Initially, we couldn't really see any pattern with only 2 and 4 imdb_score bins. So we split into 5 bins and saw a clearer picture. It does seem like higher budget can influence imdb_score. However, for the "horrendous" category, it seems like the budget used on them is higher.

This could mean that although budget does follow a certain trend as imdb_score increases, we ought to be careful with our budget as there is still a risk of the movie turning out to be "horrendous"

# the new bins (5 categories) we used 
# bins = [1,3,4,6,9,10], labels = ["horrendous", "very bad", "bad", "ok", "good"]

7. num_voted_users

Positively-skewed, large proportion had no number of voted users
Not much linear correlation either : with a correlation of -> 0.470567

8. imdb_score

A large proportion has imdb score of around 5-8.

The median of imdb_score is 6.5, which is why we chose one of our bins to be [0. 6.6, 10] (i.e. 0-6.5 will be classfied as "bad" and 6.6-10 as "good")

Multivariate EDA

The heat map shows that some variables affecting imdb_score are:

num_critic_for_reviews
duration
num_voted_users
num_user_for_reviews
movie_facebook likes

Step 4: Machine Learning

We explored several ML Models, the best-performing ML Model for our dataset turned out to be ..... Random Forest!

The list of models we used were:

Linear Regression
Logistic Regression
K-Means
Decision Tree
Random Forest (Main)

Linear Regression

As expected, since our dataset is highly categorically-inclined, linear regression for both bivariate and multivariate LR had low R² and MSE scores.
Below shows the scores of some bivariate LR that we attempted
Multivariate LR

Logistic Regression

Logistic Regression showed slightly better results, however, accuracy scores were not that high either.
This implies that there is not a "clear cut" between the datas.
Since there is some improvements, maybe a decision tree would show better results.
We performed Multivariate and Multiclass Logistic Regression

For mulvariate logR, scaling improved the accuracy score from 0.51 to 0.66
K-folds also improved:
- multivariate accuracy scores from 0.51 to 0.63
- multiclass accuracy scores from 0.65 to 0.66 (very slightly)
Multiclass logRegression showed better scores than binomial logRegression.
We also used other metrics like F1 scores and precision to observe our model.

K-Means

K-means is an unsupervised machine learning model. We found out that the optimal number of clusters is 3 (using elbow method)

The 2-D Grid, Parallel Coordinates Plot and Boxplot all show that budget is a huge determinant in influencing the split between the clusters!

Decision Tree

Train vs Validation results: had relatively good performance with 0.7 - 0.83 accuracy. This further confirms that our dataset is highly categorically inclined. However, train data had slightly better accuracy compared to validation, indicating that there may be slight overfitting issues.

Nevertheless, since the performance was good, we decided to use dectree on our Test dataset. Below shows the results.

Random Forest

Random Forest was the best! (Although, again, there may be slight overfitting for the same reasons as dectree + it took quite long to load)

Accuracies of :

Train Data = 0.96
Validation Data = 0.82
Test Data = 0.99

Feature importance in random forest shows how important each feature is in determining the decision the tree makes Below shows the feature importance for determining imdb_score.

Turned out that num_voted_users, duration, num_user_for_reviews, num_critic_for_reviews and budget are the top5 determinants.

It is interesting to note that the variables that indicate popularity are : num_voted_users, num_user_for_reviews, and num_critic_for_reviews, and it is not unexpected for them to be determinants of success (imdb_scores).

Interesting Questions

Does the movie title length affect imdb scores?

Unfortunately, as much as we wanted to see some correlation, our bivariate EDA tells us that there isn't any correlation. See the boxplot below! However though, there is a somewhat normal distribution in the data, with a median length of 13

What are the personalities of directors of top performing movies?

We looked into the Top20 imdb_score movies and searched for their personalities online. Here are the results !

index	director_name	personality type	index	director_name	personality type
0	Frank Darabont	INFP	10	David Fincher	INTJ
1	Francis Ford Coppola	INTJ	11	Christopher Nolan	INTJ
2	John Stockwell	INFP	12	Peter Jackson	ENFP
3	Christopher Nolan	INTJ	13	Irvin Kershner	INTP
4	Francis Ford Coppola	INTJ	14	Mitchell Altieri	n/a
5	Peter Jackson	ENFJ	15	Lana Wachowski	ENFP
6	Sergio Leone	n/a	16	Cary Bell	n/a
7	Steven Spielberg	ISFP	17	Fernando Meirelles	INFP
8	Quentin Tarantino	ENTP	18	Milos Forman	INTP
9	Robert Zemeckis	ENFP	19	Akira Kurosawa	INFJ

Observations: almost all of them (except for one - Steven Spielberg) have "N" in their personalities, which is the intuitive element.

Do you, as a movie director, have these personality traits too?

The Big Conclusion:

Our outcomes show that decision tree and random forest are the most suitable machine learning models for our data set.
This may be due to our dataset having skewed and imbalance data. Also, our dataset does not have very good linear relationships.
Duration and budget of the movie are the top 5 features affecting imdb score.
Popularity of the director and cast plays a role in determining imdb score.
The top 3 genres affecting imdb score is drama, comedy and action. This aligns with our bi-variate eda as drama is one of the most representation genres affecting imdb_score.

So a movie director should pay close attention to the aforementioned factors.

Generally, based on our EDA and ML, movies with the following attributes will do better on the imdb rating score:

Higher duration
Higher budget
More popular director and cast
Movies with the genres of drama, comedy and/or action

Beyond our Course:

Standardising budget to 2016 inflation rate as the latest movies only go up to 2016
Web scraping
Visualisations:
- 3D scatter plot & word cloud
Machine Learning:
K-modes & K-means
Logistic Regression
- using Scaler() from sklearn
Random Forest
Feature Importance
Metrics

Limitations and Discussion:

Analysis of personalities of the directors may be biased because they may be classified as those personalities based on their careers. Therefore, it may not be an accurate representation. However, it is still interesting to note their personalities!
Further analysis can be done on other variables that indicate success through popularity or movie like director_facebook_likes, num_critic_for_reviews, num_voted_users
Our dataset is quite imbalanced and skewed, therefore a larger dataset may help.

Workload Delegation:

1. Koh Zi En

ML : KMeans, Decision Tree
Presentation
EDA
Codes for EDA : EDA on last 9 Variables
Data Visualisation

2. Sandhiya Sukumaran

ML : Random Forest, Linear Regression
Presentation
EDA
Codes for EDA : EDA on mid 9 Variables
Data Visualisation

3. Yap Shen Hwei

ML : Logistic Regression
Presentation
EDA
Codes for EDA : EDA on first 9 Variables
Github
Answering Interesting Questions : Codes here

imaginaryBuddy/imdbMoviesDSAI