With the use of TMDB movie dataset which contains information about 10,000 movies, including user ratings and revenue, we are gonna investigate this dataset in order to answer some questions about it and extract some conclusions.


- Which movies made maximum and minimum and minimum profits?
- Who is most movie director?
- In which year there was most profit?
- What is most geners in movies?
- What is the relation between profits over years?
# Dependencies
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
df = pd.read_csv('tmdb-movies.csv')

Data Wrangling

General Properties

# The first row of df
Data Cleaning

Drop Extraneous Columns

extraneous_columns = ['id', 'imdb_id', 'homepage', 'tagline', 'keywords', 'overview', 'budget_adj',
       'revenue_adj', 'vote_count', 'vote_average', 'production_companies', 'cast']
df.drop(extraneous_columns, axis=1, inplace=True)
popularity budget revenue original_title director runtime genres release_date release_year
0 32.985763 150000000 1513528810 Jurassic World Colin Trevorrow 124 Action|Adventure|Science Fiction|Thriller 2015-06-09 2015

Exploratory Data Analysis

Which movies made maximum and minimum profits?

# Maximum Profit
df['profit'] = df['revenue'] - df['budget']
df_max = df[df['profit'] == df['profit'].max()]
      popularity     budget     revenue original_title       director  \
1386    9.432768  237000000  2781505847         Avatar  James Cameron   

      runtime                                    genres release_date  \
1386      162  Action|Adventure|Fantasy|Science Fiction   2009-12-10   

      release_year      profit  
1386          2009  2544505847  
# Minimum Profit
df_min = df[df['profit'] == df['profit'].min()]
      popularity     budget   revenue     original_title    director  runtime  \
2244     0.25054  425000000  11087569  The Warrior's Way  Sngmoo Lee      100   

                                         genres release_date  release_year  \
2244  Adventure|Fantasy|Action|Western|Thriller   2010-12-02          2010   

2244 -413912431  

The maximum Profit was made by "Avatar" and the minimum profit was made by "The Warrior's Way".

Who is most movie director?

0    Woody Allen
dtype: object

The director who directed the most was Woody Allen.

In which year there was most profit?

df.groupby('release_year').mean()['profit'].plot(kind='line', figsize = (10,10), color = 'orange',legend='profit')
plt.ylabel ('profit')
plt.title ('profits Vs release year')
Text(0.5, 1.0, 'profits Vs release year')


1995    3.615205e+07
1977    3.542111e+07
1992    3.486006e+07
2002    3.289090e+07
2001    3.223294e+07
2003    3.166685e+07
2004    3.134685e+07
1997    3.091145e+07
2015    3.071459e+07
1990    3.049428e+07
1989    3.003873e+07
1993    2.924024e+07
2012    2.821746e+07
2011    2.723087e+07
2007    2.707149e+07
1994    2.644686e+07
2010    2.619791e+07
2009    2.573738e+07
2005    2.527149e+07
1979    2.508738e+07
1999    2.495749e+07
1982    2.494628e+07
1991    2.436366e+07
2008    2.385986e+07
1998    2.377864e+07
2013    2.373097e+07
2014    2.364166e+07
2000    2.312390e+07
1996    2.278054e+07
1983    2.235527e+07
1987    2.202119e+07
2006    2.198420e+07
1973    2.106891e+07
1975    2.048207e+07
1985    1.969492e+07
1988    1.954308e+07
1986    1.899376e+07
1984    1.815536e+07
1980    1.802772e+07
1978    1.785819e+07
1981    1.708352e+07
1967    1.633801e+07
1974    1.599065e+07
1976    1.444374e+07
1972    1.146127e+07
1965    1.108219e+07
1970    1.083150e+07
1961    9.405909e+06
1964    7.178539e+06
1969    6.510580e+06
1971    5.980247e+06
1962    5.026804e+06
1968    4.943435e+06
1960    3.842127e+06
1963    3.355103e+06
1966    5.909106e+05
Name: profit, dtype: float64

The most average profits was made in 1995.

What is most geners in movies?

def split_compound_columns(column):
    """Split columns which has data like this; a|b|c
    Argument: column need to be seperated by '|' 
    Returns: Column of all seperated values;
    column = df[column] = '|')
    splitted_column = pd.Series(column.split('|'))
    return splitted_column
genres = split_compound_columns('genres')
genres.value_counts().plot.pie( subplots=True,figsize=(20,20), legend=True, autopct='%.1f%%',title='a')
plt.title('Movies Genres')
Text(0.5, 1.0, 'Movies Genres')


->> The most popular movie genres are drama, comedy, thriller and action.

#plotting a histogram of the Time Duration of the movies



plt.figure(figsize=(10,7), dpi = 100)

plt.xlabel('Time Duration')
plt.ylabel('Movie Numbers')
plt.title('The Time Duration of the movies')

plt.hist(df['runtime'], rwidth = 1, bins =30)


The time duration of most of the movies is around [100-120] min.

plt.figure(figsize=(10,7), dpi = 100)

What is the relation between profits over years?

df.plot(x= 'release_year' ,y= 'profit' ,kind= 'scatter', color='orange', figsize=(10,10),legend='profit')
plt.title('Relation between each year realease and Profits')
Text(0.5, 1.0, 'Relation between each year realease and Profits')


->> Positive correlation between Release Year and Profit.

popularity budget revenue runtime release_year profit
popularity 1.000000 0.544858 0.663094 0.140527 0.091347 0.628833
budget 0.544858 1.000000 0.734685 0.193883 0.117470 0.569941
revenue 0.663094 0.734685 1.000000 0.165239 0.058068 0.976165
runtime 0.140527 0.193883 0.165239 1.000000 -0.117172 0.138113
release_year 0.091347 0.117470 0.058068 -0.117172 1.000000 0.032752
profit 0.628833 0.569941 0.976165 0.138113 0.032752 1.000000
fig, ax = plt.subplots(figsize=(15,10))         
sns.heatmap(df.corr(), annot=True, fmt="f",ax=ax)
plt.title('correlation matrix for DataFrame')
Text(0.5, 1.0, 'correlation matrix for DataFrame')




1. Not always the high budget of the movie leads to gaining high profits.
2. The most likeable genres are drama,comedy, thriller and action.
3. The less likeable genres are tv movie, western, foreign and war.
4. Dealing with popular actors in the cast besides a great director is a gurantee.


1. Some profits are negative.
2. If we didn’t clean the data, there is no consistency in it so it’s necessary to do so.