Web Scraping and EDA

The goal of this notebook is to produce a dataframe with the information about movies from Box Office Mojo. The dataset obtained from webscraping will be used for exploratory data analysis, linear regression modeling and feature engineering. The resulting dataframe will be saved in the csv format for easier access in the future, as well as in movies.py python file, located in the same repository, for future projects. The code executed in the webscraping notebooks will result in extracting data about domestic movies from 2018 and 2019 (roughly 1600 or so), and will provide us with some insights, such as movie title, total domestic gross revenue, runtime, rating, and budget.

File structure

Webscraping is done in two notebooks labeled as "Webscraping", while exploratory data analysis and linear regression modeling is done in "project2_EDA + LR".

Contributing

Pull requests are welcome. For major changes, please open an issue first to discuss what you would like to change.

sasha-talks-tech/web_scraping

Web Scraping and EDA

File structure

Contributing