This project is the back end for the Crime Data Explorer 18f/crime-data-explorer. The Crime Data Explorer is a website that allows law enforcement and the general public to more easily access uniform crime data. The FBI collects this data from state and local law enforcement agencies and publishes it in the form of annual reports. This API allows users to look at multiple years of data as well as other information only provided as a master file from the FBI.
This project is a Python flask app that uses SqlAlchemy to wrap a Postgres database.
The full database of UCR crime data is too large to include in this repository or as a download -- over many hundreds of gigabytes in size -- comprising many tables that are linked together by primary and foreign key constraints. Instead, this project includes a small excerpt of the database that mainly comprises only data for Rhode Island in 2014. This slice means that API endpoints will return some data and that we can run our tests, but you will have to load the full database to see more data.
See the section of "Loading the Full Database" for information about loading the full database and the tasks we run after loading to add additional data and derive new tables.
Note: if you have a full database running somewhere on cloud.gov or the like, it's possible to setup a SSH tunnel to use that database instead. It's recommended you don't do this for development though.
First, let's set some environment variables that will be useful. You
might want to define these within a .env
file that can be used by
autoenv
export FLASK_APP=/path/to/autoapp.py
export FLASK_DEBUG=1
export CRIME_DATA_API_DB_URL="postgres:///crime_data_api_dev"
export APP_NAME=crime-data
You can run the following commands to bootstrap your environment and load the testing database into Postgres:
git clone https://github.com/18F/crime-data-api
cd crime-data-api
pip install -r requirements/dev.txt
createdb crime_data_api_dev
psql crime_data_api_dev < dba/crime_data_api_test.sql
flask run
You can then find the Swagger UI API documentation page at http://127.0.0.1:5000/. To test out the API, make a request to http://127.0.0.1:5000/agencies
The app is continuously deployed to cloud.gov
with CircleCI with every
commit to master
.
Tagged releases must be deployed manually using the
manifests/demo.yml
manifest file. You can use the following set of
steps once you have authenticated with cloud.gov:
git checkout <<version tag>>
- i.e.git checkout v2017-04-12
cf push -f manifests/demo.yml
In production environments, make sure the FLASK_DEBUG
environment
variable is unset or is set to 0
, so that ProdConfig
is used.
To open the interactive shell, run :
flask shell
By default, you will have access to the flask app
.
Tests are written in the py.test testing framework, with additional support from webtest for functional testing of the API. As part of its testing, it also will output coverage statistics for the tests.
To make it easier to see what the APIs can do, this project includes endpoints the return Swagger documentation for the API. These documents are not yet complete or consistent and can not yet be used to automatically generate API clients or to validate API responses. But they do let you test out the API and see roughly what the valid behavior is.
Unlike some other project that dynamically generate their Swagger from decorations applied to methods, we are using a static swagger.json file for this project. This means that developers will have to update the swagger file for any new methods or changes to the responses for existing methods. This is admittedly not immediately convenient, but it lets us treat the swagger.json file as a contract of what the API responses should be and use that for functional tests to ensure our responses conform to what is described.
We are using the flex Swagger validator in functional tests to verify they match Swagger. To add, modify your functional test to be like the following
from flex.core import validate_api_call
def TestCountsEndpoint:
def test_counts_matches_swagger(self, testapp, swagger):
res = testapp.get('/counts')
validate_api_call(swagger, raw_request=res.request, raw_response=res)
The swagger
fixture in py.test loads the
crime_data/static/swagger.json
into a Flex object so we can use it
for validating that requests and responses match what is in the
Swagger schema.
If you have loaded a full database from the FBI into production, there are still some other steps that you must run to use the data with the CDE. This is because the CDE does not use many of the tables within the UCR database directly, but instead has its own tables that are derived from and supplement the default database. There are several general types of such tables used by the CDE API:
- Additional data tables are necessary to load in some data that is not provided within the UCR database itself. For instance, estimated crime counts for the nation and state, crime figures for the territories and corrected cargo theft incidents are all examples of external tables that are also loaded into the database.
- Denormalized tables pull together data from multiple tables into a single record. The UCR database is in a highly normalized format that mitigates redundancy to reduce error, but it requires the database to join across dozens of tables to return the results of any API call, which is not performant.
- Aggregate tables collect monthly statistics in Return A and related reports from agencies and add them at the annual level. Similarly, we create annual counts from related monthly reports like arson as well as a few incident-level reports like hate crimes or cargo theft.
- NIBRS counts are a special type of aggregate tables that assemble ETL rollups of crime counts for specific dimensions within NIBRS incidents and related tables. So for instance, this table can tell you the annual counts by race_code or sex of all victims of aggravated assault for a given agency, or the general locations where crimes occur in a state. Because they are both specialized and could be large, there are many distinct tables for this information and different API endpoints for accessing them.
For a more detailed view of what tables are added to the stock UCR
database, see the related document CDE_TABLES.md. The important thing
here is how to load them. We have a script called after_load.sh
that
can be run to generate the necessary CDE tables. It's meant to be
idempotent and can be run multiple times safely. Some tables may take
a long time to build. Note that the script will output a fair amount
of debugging text under normal operation.
To run the after_load tasks, first you need to set an environment variable with a database URL with the location of the production database. I find the easiest way to do this is to setup a tunnel using the cf-service-connect plugin, which you must install to do this. You can then run it with the name of an running app service and the name of its database to get a list of connection values.
cf connect-to-service --no-client crime-data-api crime-data-upload-db
export CRIME_DATA_API_DB_URL=postgres://USERNAME:PASSWORD@localhost:PORT/DBNAME
and in another terminal running psql $CRIME_DATA_API_DB_URL
to
connect to the database through the tunnel. Make sure of course that
you aren't doing any of these steps to a production database serving
API requests to the public, since this operations will certainly cause
some rolling errors and downtime (best to work with a snapshot).
Anyhow, to augment the default UCR database with CDE tables, you need to run the following steps.
cd dba/after_load/ext
./load_external.sh
cd ../../
./after_load.sh
Note that while the load scripts are generally idempotent, the external data includes renaming the existing cargo theft tables to preserve them before loading revised data. This may spit out an error if you attempt to run again.
Also, please note that the after_load.sh runs a bunch of queries that might be quite slow. It might be better to run the commands inside individually if you have a lower-powered DB instance.
The master list of edited police names is located in
dba/after_load/ext/agency_name_edits.csv
. To update the names of
agencies, make changes to the edited_agency_name
column in the
CSV. Then do the following with an open tunnel to the production
database.
cd dba/after_load/ext
psql $CRIME_DATA_API_DB_URL < update_agency_names.sql
This will load in the agencies CSV from your machine than update a few tables where the agency name is displayed. Don't be alarmed if the same number of rows is not updated in every table, since some agencies may not be in various derived tables or may be present multiple times in others.
Don't forget to check in your changes to the agency_name_edits.csv and file a pull request when you are done.
This repository uses the bandit tool to run automated static analysis of the project code for potential vulnerabilities. These are run automatically as part of continuous integration to identify potential vulnerabilities when they are introduced in pull requests.
You can run bandit locally with the following command:
bandit -r .
In some cases, bandit will identify false positives, code that looks
like it could be a security vulnerability but that will likely never
be triggered in a production environment. To disable reporting of
these vulnerabilities, you can append a #nosec
comment on the line
of code where the vulnerability was identified.