/TMDb

Investigate TMDb movie data

Primary LanguageJupyter Notebook

Investigate TMDb movie data

In this notebook, I investigate the movie data from [The Movie Database (TMDb)](https://www.themoviedb.org/)


I have explored three questions:

  • Does budget relate to vote average of the movie?
  • What percent of the movie made no revenue?
  • Does the average revenue increase over time?


The notebook including data wrangling and exploratory data analysis.


In data wrangling, I first checked the general information, like the structure of data, missing values, duplicates. Then I cleaned the data by:

  • Find the duplicated row, and drop these data.
  • Drop columns that are not relevant and have a lot missing value.
  • Fill row with missing value.


In the exploratory data analysis, I did the following:

To answer the first question "Does budget relate to vote average of the movie?",

  • Use scatter plot to explore the relationship between budget and avareage vote.
  • Divide the vote into three bins, and use bar chart to visualize the budget in each category.

To answer the second question "What percent of the movie made no revenue?",

  • Separate the dataset by positive revenue or no revenue.
  • Visualize the percentage of negative vs positive using pie chart.

To answer the third question "Does the average revenue increase over time?",

  • Plot average revenue over time and add regression line to see the trend.
  • Plot revenue of each movie over time to explore the reason for the pattern in previous graph.
  • Base on my assumption, plot sum of revenue and number of movie made each year to confirm the assumption.