👁 eyeball 👁
Here thar be a simple scraper that compares the dataset totals on data.gov/metrics to those logged in the archive.csv
file at the root of this repo and logs when they change.
Requirements
- Python 3.x
- lxml (
pip install lxml
) - requests (
pip install requests
) - csvkit (
pip install csvkit
)
Makin' data
Make a virtual environment, and install the requirements.
mkvirtualenv eyeball
pip install -r requirements.txt
Set VIRTUAL_ENV
to the location of your virtual env bin/
.
export VIRTUAL_ENV=~/.virtualenvs/eyeball/bin # Assumes virtualenv called eyeball
GNU Make
If you haveMake the data.
make clean
make all
If you don't have Make
Update archive.csv
and output/log.csv
.
$VIRTUAL_ENV/python app.py
Then generate the summary file.
$VIRTUAL_ENV/csvsql --query " \
SELECT \
parent_agency, \
subagency, \
SUM(delta) AS net_difference \
FROM log \
GROUP BY subagency \
ORDER BY SUM(delta) DESC" \
output/log.csv > output/churn_summary.csv
What am I looking at?
archive.csv
contains the dataset counts from the last time the scraper was runoutout/log.csv
is a running log of each time those counts changed, withdelta
being the net difference between the observed total from the scraper and the total inarchive.csv
output/churn_summary.csv
is the net difference in dataset counts for each subagency between ~Jan. 24 and the last time the scraper was run
A note about dataset counts:
"... A collection is a group of similar datasets--for example, if there's a dataset that's created every year--so you could have multiple year's worth of data counting as one dataset. Sometimes when agencies organize and group similar datasets as a collection, the total number on catalog.data.gov can decrease significantly when the actual data available has not changed."
tl;dr - A "loss" of records does not always mean those records were deleted; they may have been reorganized.