Diversity in the film industry has been a topic of interest in recent years. Since, many discussions arise from diversity topics. On the one hand, ethnic minorities may feel underrepresented in the film industry. On the other hand, people may claim diversity is being forced when people of color enter settings previously deemed "white" in famous historical depictions, for example in the recent live action remake of 'The Little Mermaid'. Essentially, the film sector serves as a type of mirror for our culture. People typically follow or are influenced by what they see daily. People probably watch entertainment the most, therefore diversity and representation are especially important there. The ability for viewers to recognize themselves and their experiences on screen fosters a sense of connection and understanding, which is why representation in terms of diversity as well as gender is important. Analysis of whether and how actor diversity and gender representation has changed throughout the years within the film industry is therefore quite intriguing.
This project is aimed at the following research question:
'How does the movie release year influence the actor diversity, measured by the Shannon index, and gender representation in the film industry?'
From the IMDb Non-Commercial Datasets the following datasets were used:
-
name.basics.tsv.gz
-
title.basics.tsv.gz
Additionally, the Harvard Actor Racial Line Dataset was used.
The dataset in its entirety consists of 31 variables. However, for this analysis a selection of 3 relevant variables will be used, namely:
Variables | Description |
---|---|
CHARACTER_RACE | The ethnicity of the actor |
GENDER | The gender of the actor |
startYear | The release year of the movie |
The Shannon index is an index that measures diversity, which is based on ethnicity in this research. This will be calculated using the 'CHARACTER_RACE' variable from the dataset.
The formula to calculate the Shannon index is:
Where:
-
H is the Shannon index, which represents the actor diversity.
-
S is the number of different categories.
-
pi represents the proportion of the total occurrences that belong to the ith category, which is calculated by the proportion of the "CHARACTER_RACE" variable in every year.
Regression analysis will be performed on the dataset to reach a conclusion. Regression analysis is essential for understanding the factors influencing actor diversity in movies when using the 'Shannon index' and 'gender' as the dependent variables. We will perform two separate analysis to investigate the link between the independent variable startYear, and the two dependent variables, namely 'Shannon index' and 'gender'. Using this statistical method since it allows us to measure the influence of these variables and find important predictors. This approach makes it possible to gain data-driven insights regarding the diversity of the film industry.
The formula for the linear regression is:
Where:
-
Y is the dependent variable, which is the Shannon index or the female gender.
-
β0 is the intercept.
-
β1 is the coefficient for the independent variable (startYear).
-
X1 is the value of the independent variable (startYear).
The results of the regression analysis show that there is a significant positive effect of the release year of a movie on the Shannon index, and therefore the actors' race diversity. However, the increases in diversity is very minimal each year.
The results of the regression analysis show that there is a significant positive effect of the release year of a movie on the female gender. However, the current gap between male and female is still considerably large.
├── data
├── gen
├── analysis
├── data-preparation
└── paper
└── src
├── analysis
├── data-preparation
└── paper
├── .gitignore
├── README.md
├── makefile
- LaTeX
To run our project on Windows, ensure LaTeX is installed using the following instructions:
-
For General R Users: You can set up LaTeX by running
tinytex::install_tinytex()
in R. -
For RStudio Users: To set up LaTeX in RStudio, run
tinytex::install_tinytex()
in the RStudio console.
- Pandoc
To run our Makefile on all computers, install Pandoc from https://pandoc.org/installing.html following the instructions for your specific operating system.
- XQuarts
To run our Makefile on Linux or Mac, install XQuartz from https://www.xquartz.org, then follow on-screen instructions.
The makefile runs an Rscript which automatically installs all missing packages in R. Our project depends on the following packages;
library(tidyverse)
library(readr)
library(dyplyr)
library(tibble)
library(stringr)
library(ggplot2)
Cloning the repository
- Open your terminal (on mac) / Gitbash (on windows)
- Set your working directory to the preferred location
- Type
git clone https://github.com/course-dprep/actor_diversity_gender_representation_film_industry
Running the makefile
- Change the working directory of your terminal to
actor_diversity_gender_representation_film_industry
- Type
make
-
IMDb Non-Commercial Datasets: https://developer.imdb.com/non-commercial-datasets/
-
Harvard Actor Racial Line Dataset: https://dataverse.harvard.edu/api/access/datafile/:persistentId?persistentId=doi:10.7910/DVN/KERZQY/E3ODSJ
This repository was created for the course Data Preparation and Workflow Management taught by Hannes Datta, at the Tilburg School of Economics and Management, as part of the Master's program Marketing Analytics. This repository is maintained by Team 13, which consists of:
-
Stefano Greco Barriada
-
Stan van Goor
-
Corinne Inzirillo
-
Arda Reyhan