/fbo-scraper

Predicting FedBizOps doc compliance with section 508

Primary LanguagePython

CircleCI

fbo-scraper (AKA Smartie)

FBO is the U.S. government's system of record for opportunities to do business with the government. Each night, the FBO system posts all updated opportunities as a pseudo-xml file that is made publically available via the File Transfer Protocol (FTP), which is a standard network protocol used for the transfer of computer files between a client and server on a computer network.

This project uses supervised machine learning to determine whether or not the solicitation documents of Information Communications Technology (ICT) notices contain appropriate setion 508 accessibility language.

Following a service-oriented architecture, this repository, along with a forthcoming API, provides a back-end to a UI that GSA policy experts will use to review ICT solicitations for 508 compliance; notify deficient solicitation owners; monitor changes in historical compliance; and validate predictions to improve model performance.

The application is designed to be run as a cron daemon within a Docker image on cloud.gov. This is tricky to achieve as traditional cron daemons need to run as root and have opinionated defaults for logging and error notifications. This usually makes them unsuitable for running in a containerized environment. So, instead of a system cron daemon, we're using supercronic to run the crontab.

Here's what happens every time the job is triggered:

  1. Download the pseudo-xml from the FBO FTP
  2. Convert that pseudo-xml to JSON
  3. Extract solictations from the Information Communications Technology (ICT) categories
  4. Srape each ICT soliticiaton's documents from their official FBO urls
  5. Extract the text from each of those documents using textract
  6. Feed the text of each document into a binary classifier to predict whether or not the document is 508 compliant (the classifier was built and binarized using sklearn based on approximately 1,000 hand-labeled solicitations)
  7. Insert data into a postgreSQL database
  8. Retrain the classifer if there is a sufficient number of human-validated predictions in the database (validation will occur via the UI)
  9. If the new model is an improvement, save it and carry on.

Getting Started

Prerequisites

This project uses:

  • Python 3.6.6
  • Docker
  • PostgreSQL 9.6.8

Below, we suggest venv for creating a virtual environment if you wish to run the scan locally.

To push to cloud.gov or interact with the app there, you'll need a cloud.gov account.

There are two docker images for this project: fbo-scraper and fbo-scraper-test. The former contains the application that can be pushed to cloud.gov (see instructions below) while the latter is strickly for testing during CI.

Local Implementation

If you have PostgreSQL, you can run the scan locally. Doing so will create a database with the following connection string: postgresql+psycopg2://localhost/test. To run it locally (using FBO data from the day before yesterday), do the following:

cd path/to/this/locally/cloned/repo
python3 -m venv env
source env/bin/activate
pip install -r requirements.txt
#now you can run the scan, with logs writing to fbo.log
python fbo.py

Running the tests

To run the tests, set up the environment like before but instead run:

python3 -W ignore -m unittest discover tests -p '*_test.py'

Several warnings and exceptions will print out. Those are by design as they're being mocked in the tests.

Deployment

Deployment requires a cloud.gov account and access to the application's org. If those prequisites are met, you can login with:

cf login -a api.fr.cloud.gov --sso

Then target the appropriate org and space by following the instructions.

Then push the app, creating the service first:

cf create-service <service> <service-tag>  
cf create-service-key <service-tag>     *this may take a few minutes to configure*  
cf push srt-fbo-scraper --docker-image csmcallister/fbo-scraper
cf bind-service srt-fbo-scraper <service-tag>  
cf restage srt-fbo-scraper

Below, <service> is the name of your postgres service of choice (e.g. shared-psql) while <service-tag> is whatever you want to call it.

Logs

Logs are stored within the app in fbo.log. To access them, log into cloud.gov with:

cf login -a api.fr.cloud.gov --sso

And then target your desired space. You can then ssh into the app, nav to the log's directory, and access the contents:

cf ssh srt-fbo-scraper
cd ../code/
cat -n fbo.log

Contributing

Please read CONTRIBUTING for details on our code of conduct, and the process for submitting pull requests to us.

License

This project is licensed under the Creative Commons Zero v1.0 Universal License - see the LICENSE file for details

Acknowledgments