/eatsurevan_grab

Towards an implementation of EatSure.ca with Vancouver Costal Health data

Primary LanguagePython

EatSure Van Data Grabber
Tim Smith, Naoya Makino
December 4, 2010

Hacked together real fast! 'Cause that's how we roll. Here are some scripts to scrape restaurant inspection data from the unfriendly Vancouver Coastal Health restaurant inspection website. (http://www.foodinspectionweb.vcha.ca/)

See it more or less in action at: http://www.google.com/fusiontables/DataSource?snapid=111755 -- I recommend the map view, myself. Note that inconsistent address information causes some of the points to appear in unlikely locations

Things that you need:
1. Python 2.x
2. BeautifulSoup
3. A sense of adventure

Run them in this order:
1. python fetchandparselist.py
    Fetches and parses the entire list of restaurants from VCH. Creates
    restaurants.html. Outputs restos.tab, which is a tab-separated text file.
    It contains restaurant name, address, and date of last inspection.
    The last column is the restaurant's GUID.
2. mkdir restos; python fetchrestos.py
    Fetches the restaurant info pages for each individual restaurant.
    For bonus points, parallelize this yourself; it takes a while.
3. python parserestos.py
    Creates an index of inspections that have occurred, the date, and the type
    of inspection. Creates inspections.repr.

To fetch each individual inspection report (our current "score" does not depend
on these, but probably should):
4. mkdir inspections; python fetchinspections.py start finish
    start and finish are indicies into the list of restaurants, so you can
    fetch a sliver of the data set at a time. This is useful for parallelizing
    the fetch. For example, you can start six simultaneous instances of
    fetchinspections... [0 1000), [1000 2000), [2000 3000), [3000 4000),
    [4000 5000), and [5000 6000). Python will not complain if you overshoot
    the end of the data set on the last one. There are slightly fewer than 6000
    restaurants in the database. You will see some 500 errors, which happen
    because the scraper (erroneously) tries to grab a Permit object instead
    when there are no inspections to grab. These should not be cause for
    concern.

To generate the score reports we have now:
1. python score.py
    Generates scores.pickle, which contains a dictionary linking the restaurant
    GUID to the number of re-inspections noted recently on the VCH website
    (the "score").
2. python mkcsv.py
    Generates a CSV file suitable for upload to Google Fusion Tables. Creates
    (imaginitively) output.csv.

And that's where we are.