839 Project Stage 2: IMDB and Movie Numbers Web Page Movie Extraction

Introduction

This project follows the guidelines for AnHai's CS 839 Data Analysis course for project stage 2, and contains scripts and data for two movie lists from IMDB.com and the-numbers.com. We have the scripts arranged in a code folder and the json and csv formatted data in the data folder.

Running the Extractors

Scraping the IMDB Website

Running the scraper to scrape the IMDB website is shown below. If you are running the scraper for the first time, then make sure to use the "--generate" optional argument. This optional argument tells the scraper to actually go to the web url to pull information for creating the .csv file, and generates a corresponding intermediate json file. If the json file has already been generated, then the optional --generate option can be dropped. However, this script assumes that the "imdb_movies.json" file has already been generated in that case, and that the file is in the same folder as the scraper script. Also note that if you already have a generated "imdb_movies.json" file generated and want to generate another file, you need to remove the original file first before generating a new file. If you don't, the parser will throw json parse errors.

cd CODE/
python scrapeIMDB.py --generate

Scraping the Movie Numbers Website

Running this scraper is the same as running the previous scraper.

cd CODE/
python scrapeMovieNumbers.py --generate

Built With

Python 3.4.3
Numpy 1.13.1 - Numerical matrix lib
Pandas 0.21.0 - Data manipulation lib
Scrapy 1.5.0 - Web scraping framework

Authors

Daniel Griffin - dcompgriff
Yudhister Satija
Mitali Rawat

dcompgriff/cs839_Video_Database