This project collects data from various City of Dayton area websites to power our open data catalog.
First, install Vagrant for your host operating system. Vagrant runs virtual machines inside VirtualBox on your host OS. We use a common image and configure it using Vagrant (Puppet) so that all developers have the same runtime environment.
To start the virtual machine type
vagrant up
Wait for installation and setup to complete (several minutes), then SSH into the virtual machine, start the virtual environment and change to the synced folder
vagrant ssh
source ~/virtenv/bin/activate
cd /vagrant
Spiders crawl a website, extracting data from the web pages. Each spider is customized for a specific website or set of websites.
You can view the available spiders by running
scrapy list
You can start the crawling using the below command:
scrapy crawl spider_name
Scraped data is output as JSON in 'feed.json'