- Target: Scrape 2000 movie information from Rotten Tomato
- Please download the Cache File first!
- Download the cache file (or you will have to wait for around 1 hour to get the data)
pip3 install -r requirements.txt
- Create a database named
chuyao_507finalproject
- Create a
plotlyconfig.py
and fill in with your own Plotly config.(You can refer toplotlyconfig_example.py
) - Run
SI507F17_finalproject.py
- I suggest you to change the
SI507F17_finalproject.py
line 184:movie_list = return_movie_list(2000)
intomovie_list = return_movie_list(5)
before running test file. This would save you a lot of time. - Run
SI507F17_finalproject_test.py
-
Cache 2000 urls of movies from this page into
url.csv
. Since the page is written with javascript, I usedPhantomJS
to fake-click theshow more
button. -
Use
url.csv
to scrape 2000 pages of movie's information. Cache the json data intocache_contents.json
-
Create a class:
Movie()
, a Movie instance includes following attributes:- Movie Name
- Genre
- Director
- Date in Theatre
- Box Office
- Tomato Meter
- Tomato Meter Number
- Audience Score
- Audience Number
-
Store the 2000 movie information into
data.csv
- Includes 2 tables
basic_info_of_movie
name (PRIMARY KEY)
genre
director
time_in_theatre
boxoffice
tomato_meter
movie_id (PRIMARY KEY)
name (FOREIGN KEY)
tomato_meter
tomato_num
audience_score
audience_num
- insert 2000 data into two tables
Use Plotly to visualize the scatter relation between the tomato meter and audience score, and the relation between the tomato meter and genre. After running the program, you can get two html files showing following images