We envision a future in which the public can easily understand how and why personally identifiable information gets collected by government agencies.
To get there, we're working with federal privacy offices and structuring data from PDFed privacy-related compliance documents. By structuring data, we're equipping privacy offices with the ability to more quickly search through these documents, reducing unnecessary manual practices and laying a foundation for them to more easily collaborate with engineering teams.
This project is funded by 10x.
Privacy Dashboard development repo here
Our phase three work is happening in partnership with the GSA's Privacy Office.
The scraping code is written in Python and runs locally. We recommend creating a virtual environment using virtualenv to install and manage the required Python libraries. Run these commands in the repository directory on your machine to create a local virtual environment, start it, and then install all requirements.
virtualenv .venv
source .venv/bin/activate
pip install -r requirements.txt
Running python sorn_scraper.py
does the following:
- Fetches the contents of the page where GSA publishes links and descriptions of System of Records Notices (SORNs)
- Scrapes the unique SORN identifiers contained in each federalregister.gov url and crafts url for the XML version of the full text document
- Downloads those XML files and parses them to get the text from specific sections of the document:
- System Name
- PII
- Purpose
- Retention Policy
- Routine Uses
- Document Title
- Outputs text from these fields into a local .csv file called
gsa_sorns.csv
with one row per system.