In this notebook, I investigate the movie data from [The Movie Database (TMDb)](https://www.themoviedb.org/)
I have explored three questions:
- Does budget relate to vote average of the movie?
- What percent of the movie made no revenue?
- Does the average revenue increase over time?
The notebook including data wrangling and exploratory data analysis.
In data wrangling, I first checked the general information, like the structure of data, missing values, duplicates. Then I cleaned the data by:
- Find the duplicated row, and drop these data.
- Drop columns that are not relevant and have a lot missing value.
- Fill row with missing value.
In the exploratory data analysis, I did the following:
To answer the first question "Does budget relate to vote average of the movie?",
- Use scatter plot to explore the relationship between budget and avareage vote.
- Divide the vote into three bins, and use bar chart to visualize the budget in each category.
To answer the second question "What percent of the movie made no revenue?",
- Separate the dataset by positive revenue or no revenue.
- Visualize the percentage of negative vs positive using pie chart.
To answer the third question "Does the average revenue increase over time?",
- Plot average revenue over time and add regression line to see the trend.
- Plot revenue of each movie over time to explore the reason for the pattern in previous graph.
- Base on my assumption, plot sum of revenue and number of movie made each year to confirm the assumption.