/nosfinanceslocales_scraper

Scrape financial data of cities, EPCI, departments and regions

Primary LanguageJupyter NotebookMIT LicenseMIT

NosFinancesLocales scraper

This project aims at scraping financial data of cities (="communes"), EPCI (group of cities Cf. wikipedia), department and regions from the website http://www.collectivites-locales.gouv.fr/.

This project uses scrapy to crawl and scrape data.

All the data scraped for the regions is committed as an example here:

Usage

To crawl data of a give zone type (city, epci, department or region) on a given fiscal year YYYY, run in the root dir:

scrapy crawl localfinance -o scraped_data_dir/zonetype_YYYY.json -t jsonlines -a year=YYYY -a zone_type=zonetype

To scrape data for all available years for a given zone type:

source bin/crawl_all_years.sh zonetype

To generate a csv file with all data for a given zonetype and with french header, run:

python bin/make_csv.py zonetype

This command will generate a file in nosdonnees/zonetype_all.csv which you can then upload on nosdonnees.fr website.

The last uploaded dataset is currently available here.

Requirements

See requirements.txt file.

Tests

Run all

unit2 discover

Run one test

python test/test_commune_parsing.py Commune2009ParsingTestCase

Download an html file to add a new test

Here are some examples to download html pages for region, department, epci and city at year 2014 : curl -X POST -d "REG=025&EXERCICE=2014" http://alize2.finances.gouv.fr/regions/detail.php > test/data/region_2014_account.html

curl -X POST -d "DEP=002&EXERCICE=2014" http://alize2.finances.gouv.fr/departements/detail.php > test/data/department_2014_account.html

curl -X POST -d "NOMDEP=ALLIER&ICOM=008&DEP=003&TYPE=BPS&PARAM=0&EXERCICE=2014&SIREN=240300418" http://alize2.finances.gouv.fr/communes/eneuro/detail_gfp.php > test/data/epci_2014_account.html

curl -X POST -d "ICOM=234&DEP=045&TYPE=BPS&PARAM=0&EXERCICE=2014" http://alize2.finances.gouv.fr/communes/eneuro/detail.php > test/data/commune_2014_account.html

TODO

  • Add some docs, especially indicate the mapping between variable names and fields in html pages.