Diego Duque
The goal of this project is to see if it is possible to predict the success of a movie and which factors affect the success of a movie using public data. The Box Office Mojo website was used to get the data through scraping. I used Python BeautyfulSoup for the whole scraping process and other packages for the modeling.
A hypothetical situation was created where a new movie studio asked us to see if it possible to predict the success of a movie and what are the key factors that make it possible. Consequently, I was approached to assume the Data Science consultant role. The client explicitly asked to use the Box Office Mojo website and present the results.
The data scraped includes all the movies from 2010 to 2019. This data was scraped using BeautyfulSoup which is an efficient package to scrape multiple websites using Python algorithms. In total 2000 movies were scraped and 1151 used after cleaning the data.
Scraping: The open-source software Jupyter Notebook was used running Python tools to clean, aggregate, and visualize the data.
Modeling: Linear Regression (OLS) was used to analyze the numerical values: Budget, Widest Release and Runtime (in minutes): R^2 = 0.587.
Polynomial Regression was also used to improve our baseline model: R^2 = 0.756
Later, Feature Engineering was used to add the qualitative data like Season, Distributor, MPAA Rating, and Genres. Here the OLS with the qualitative was R^2 = 0.882
During this featured engineering Ridge CV was used: R^2 = 0.896
LASSO CV was also used:
R^2 = 0.894
- BeautyfulSoup for scraping.
- Sklearn to model.
- Pandas, Datetime, and Numpy for data manipulation.
- Matplotlib, Seaborn, Numpy, and StatsModels for plotting.
By using this link provided, the slides used during the presentation may be accessed.