pr_scrapers

Scrapers to collect press releases. In Python.

This project is a stand-alone sandbox for easy development of press release scrapers for incorporation into churnalism.com. It's designed to require minimal setup...

Requirements

lxml
BeautifulSoup (for UnicodeDammit)

Details

Just copy an existing scraper (eg onepoll.py) and start hacking about!

base.py is a mockup of the churnalism.com scraper interface. Rather than working against a database it just dumps scraped press releases out to stdout. It provides the BaseScraper interface to derive scrapers from. It also installs a caching handler for urllib2 which creates a ".cache" directory to stores downloaded files. This makes repeated test runs during development a lot quicker. Just delete the ".cache" dir to clear the cache.

To try out your scraper, add something like this:

if __name__ == "__main__":
    scraper = Scraper()
    scraper.run()

Then you can just run it directly, eg:

$ python <your_scraper>

mediastandardstrust/pr_scrapers

pr_scrapers

Requirements

Details