/eyeball

👀 keepin' an eye on data dot gov to ensure public data stays public

Primary LanguagePython

👁 eyeball 👁

Here thar be a simple scraper that compares the dataset totals on data.gov/metrics to those logged in the archive.csv file at the root of this repo and logs when they change.

Requirements

  • Python 3.x
  • lxml (pip install lxml)
  • requests (pip install requests)
  • csvkit (pip install csvkit)

Makin' data

Make a virtual environment, and install the requirements.

mkvirtualenv eyeball
pip install -r requirements.txt

Set VIRTUAL_ENV to the location of your virtual env bin/.

export VIRTUAL_ENV=~/.virtualenvs/eyeball/bin  # Assumes virtualenv called eyeball

If you have GNU Make

Make the data.

make clean
make all

If you don't have Make

Update archive.csv and output/log.csv.

$VIRTUAL_ENV/python app.py

Then generate the summary file.

$VIRTUAL_ENV/csvsql --query " \
        SELECT \
          parent_agency, \
          subagency, \
          SUM(delta) AS net_difference \
        FROM log \
        GROUP BY subagency \
        ORDER BY SUM(delta) DESC" \
    output/log.csv > output/churn_summary.csv

What am I looking at?

  • archive.csv contains the dataset counts from the last time the scraper was run
  • outout/log.csv is a running log of each time those counts changed, with delta being the net difference between the observed total from the scraper and the total in archive.csv
  • output/churn_summary.csv is the net difference in dataset counts for each subagency between ~Jan. 24 and the last time the scraper was run

A note about dataset counts:

"... A collection is a group of similar datasets--for example, if there's a dataset that's created every year--so you could have multiple year's worth of data counting as one dataset. Sometimes when agencies organize and group similar datasets as a collection, the total number on catalog.data.gov can decrease significantly when the actual data available has not changed."

tl;dr - A "loss" of records does not always mean those records were deleted; they may have been reorganized.