/data-processing

All scripts/notebooks to clean, scrape, merge or otherwise process data files

Primary LanguageJupyter NotebookGNU General Public License v2.0GPL-2.0

Texas Justice Initiative - data processing

To learn more about TJI, visit our website at www.texasjusticeinitiative.org

The data itself lives in our data.world account

About this repo

Many different datasets and files are used by the TJI website and our analyses. All non-manual data processing steps live in this repo.

All scripts/notebooks to clean, scrape, merge or otherwise process data files. There are two main folders:

  • data_scraping/ - reads data from anywhere on the internet and writes csvs to TJI's data.world account
  • data_cleaning/ - files should both be READ FROM and WRITTEN TO the TJI data.world account. Any dataset not on data.world should be scraped or manually added to data.world first.
    • The output of a cleaning script should be a file whose name begins with clean_.

Overview for developers and data engineers

To regenerate data for the TJI website (repo) requires two steps:

1. Run the data cleaning scripts

  • Run these notebooks to generate the cleaned Officer Involved Shooting (OIS) datasets:
    • data_cleaning/clean_ois_civilians_shot.ipynb
    • data_cleaning/clean_ois_officers_shot.ipynb
  • Run this notebook to generate the cleaned Custodial Death Report data:
    • data_cleaning/clean_cdr.ipynb
  • Notes:
    • The raw data is manually maintained by Eva in Google Drive and automatically synced to data.world, but this data needs to be cleaned before it is ready for analysis or website use.
    • These notebooks both read from and write to data.world -- see later in this README for details.

2. Run create_datasets_for_website.ipynb

  • This will read the cleaned datasets and generate several output files on your local machine:
    • cdr_compressed.json
    • cdr_full.csv
    • shot_civilians_compressed.json
    • shot_civilians_full.csv
    • shot_officers_compressed.json
    • shot_officers_full.csv
  • Moves these files into the data/ folder of website repo, and create a PR.

Automation

Data cleaning and compression for OIS and CDR data are currently automated via a daily cronjob. See the automation documentation for details.

Testing

To do testing in this repo, please follow the instructions in the instructions guide

TJI dataset details, means of creation, and data quirks to be aware of

[Updated: 2018-05-21]


Project: Texas Deaths in Custody from 2005-present - tji/tx-deaths-in-custody-2005-2015


  • File: cleaned_custodial_death_reports.csv
  • Description: All Texas custodial deaths since 2005 (a "custodial death" is a death in jail, prison, custody, or the process of arrest -- see Wikipedia)
  • Generation pipeline:
    1. (Manual) TJI staff manually parse and enter the data into a master spreadsheet, CDR Reports All.xlsx, in Google Drive, which is synced to data.world here
    2. A member of TJI runs this notebook to create the final file: data_cleaning/clean_cdr.ipynb
  • Quirks
    1. The Texas Department of Criminal Justice, which runs Texas prisons and a few state jails, until 2013 did NOT file custodial death reports for prisoners that died in an inpatient setting. In practice, this means that a good number of deaths from natural causes of state prisoners were not reported from 2005-2012 (you can see this clearly in the exploratory analysis here). Thus, if you simply plot custodial deaths over time, you'll see a jump from 2012 to 2013 for this reason.
    2. The form that was used to report custodial deaths changed in 2016, and by 2017 all records use the new form. The forms differ, but many questions are nearly the same. You can see the forms in this repo. The cleaning script attempts to match fields and options across form versions so the output file only has data that is consistent across all versions. See the form_version column in the output file to see what version was used for entering that record.
    3. Diligent collection of custodial deaths in texas began in 2005, but inconsistent data exists as far back as 1980. To see these older files, explore the older_versions tab of raw data file here.

Project: Officer Involved Shootings tji/officer-involved-shootings


  • File: shot_civilians.csv
  • Description: Civilians shot by police, late 2015 - present
  • Generation pipeline:
    1. A TJI bot monitors the Texas Attorney General's website for new OIS reports.
    2. New reports are emailed to TJI staff.
    3. TJI staff manually parse and enter the data into a master spreadsheet, OIS.xlsx, in Google Drive, which is synced to data.world here
    4. A member of TJI runs this notebook to create the final file: data_cleaning/clean_ois_civilians_shot.ipynb
  • Quirks
    1. There is one record for every shot civilian. Thus, if a single incident results in multiple civilians shot, there will be multiple rows with largely duplicate information (e.g. address, date, officer details, etc). Incident-level analysis should de-duplicate, say by matching on date and address.
    2. It's hard to know exactly how many officers were on scene. In theory, there are two pieces of information in each record that reveal this information. First, there is a checkbox on the form called "multiple officers involved," which is checked about 80% of the time. Second, there are spaces in the form for the details (agency, gender, race, age, etc) of each officer involved. However, when "multiple officers involved" is checked, only ~half the time do details for more than one officer exist. Similarly, sometimes "multiple officers involved" is NOT checked, yet details for multiple officers exist. It's unclear what to make of this information.

  • File: shot_officers.csv
  • Description: Peace officers shot in the line of duty, late 2015 - present
  • Generation pipeline:
    1. Identical to shot_civilians.csv above, except that in the last step, a different notebook is run: data_cleaning/clean_ois_officers_shot.ipynb
  • Quirks
    1. Analogous to the previous file, there is one record for every shot officer. Thus, if a single incident results in officers civilians shot, there will be multiple rows with largely duplicate information (e.g. address, date, civilian details, etc). Incident-level analysis should de-duplicate, say by matching on date and address.

Project: Auxiliary Datasets tji/auxiliary-datasets