This project is part of Catalog Politic, an online platform that centralizes existing information about persons in public functions in Romania.
Here we tackle the problem of gathering the data available on the internet in a semi-structured form. We use Python and Scrapy to crawl and parse the web, after which we dump the data in a MongoDB database. Please keep in mind that this project is in a very early stage, and thus some of the features are still experimental. Also, the architecture is also subject to change, since we want to be in sync as much as possible with our other similar projects, like CZL Scrape, part of Ce Zice Legea.
The only requirements for running the application are Python 3.6.1 and Scrapy 1.3.3. For development, unless one wants to generate test cases by snapshotting the website, the same requirements apply. In the case where generating tests is desired, Selenium with PhantomJS is required.
If you use pip, we put in a requirements.txt and a requirements_dev.txt file for you.
We recommend you use pip and virtualenv to setup your environment.
- macOS (tested on 10.12)
- Install homebrew.
brew install python3
pip3 install virtualenv
- Ubuntu (tested on 16.04 LTS)
sudo apt-get install python3-pip libssl-dev
pip3 install virtualenv
- Windows (tested on 8.1)
pip install virtualenv
- steps 1-5 below
pip install pypiwin32
The following should be common:
git clone https://github.com/code4romania/catalog-data-chambers
cd catalog-data-chambers
virtualenv -p python3 cdc_env
source cdc_env/bin/activate
pip install -r requirements.txt
scrapy crawl cdep -a legs=2016 -o 2016.json
These instructions are verified to work on the specified systems, but they do not have to be exacuted as given. You can customize your setup to suit your needs.
Aa legislative session lasts 4 years and it is represented by the year in which it began. For example, the session 2008–2016 is represented by the year 2008. You can use this command to crawl specific sessions. Defaults to 2016. For example, to crawl initiatives from the 2004 and 2008 sessions, you can do scrapy crawl cdep -a legs='2004 2008'
.
Crawl all years starting from this year. If this argument is provided, all others will be ignored.
Year to crawl. If no month and day specified, it will crawl every day of every month (for which it finds activity)
Month to crawl. If no day specified, it will crawl every day of that month (for which it finds activity)
Day to crawl.
scrapy crawl cdep_voting -a year=2017 -a month=6 -a day=20
scrapy crawl cdep_voting -a after=2006
This command will crawl the Romanian members of the European Parliament. The information crawled consists of the members that are in function at the time the command is run, there is currently no input parameters to crawl information about past members.
scrapy crawl euro -o 2016_eu.json
This command will crawl and dump the information about the political parties of the members of the CDEP. The crawl command is similar to the cdep command, it requires the legs as input parameters.
scrapy crawl circ -a legs=2016 -o 2016_eu.json
If you want to make sure the spider still yields the same values and request on our set of input data, you can run scrapy test
.
Sometimes you find edge cases like weird characters, unexpected elements, and so on. After you fix the problem, you want to ensure that in the future the spider does not fail on these edge cases again. For this we developed an automatic test generation tool. You can generate a test by running scrapy gentest <url> <spider> <method>
. The URL tells the generator what page to save a local HTML snapshot and spider results of. The other arguments specify the spider and the method of the spider which should parse the response and be used to generate the test answer. Frozen tests responses and results are saved in the frozen directory. When you generate a test, the generator also saves a .png screenshot of the website, so you can reference it later should the need arise.
Right now there is no proper method to manage these tests, but we are working on it. However, you can update a test by regenerating with the same URL, spider, and method. You can also delete it directly from the frozen directory.
Check out our issues page. We regularly want to gather more data and do other changes. Your help is welcome! If you found an issue you would like to tackle, just post a short message about how you want to solve it (if the task is small enough, this might not be needed). If you have any problems with the setup or with understanding our architecture, don't hesitate to contact us!