This repository contains all code used in The age of secrecy and unfairness in recidivism prediction by Cynthia Rudin, Caroline Wang, and Beau Coker.
There are two data sources that are used in our analysis but not stored in this repository:
- The database of ProPublica data, which should be stored as Raw_data/compas-analysis/compas.db.
- The probation data that we purchased. It can be downloaded here and should be stored in Raw_data/probation/.
These data sources are needed to run Table_construction.rmd. Note that the features created by this script are stored in Table_construction.rdata, and Table_construction_hist_violence.rdata.
The reason why there are two Rdata files in the repository is the following:
- The data processing that produces “Table_construction.Rdata” primarily uses charges.
- The data processing that produces “Table_construction_arrests_hist_violence.Rdata” primarily uses arrests.
The difference in the code can be found in the functions.R script. Note that we only switch between charges/arrests when computing the History of Violence subscale features. The History of Violence subscale is the main place where the distinction between charges and arrests becomes relevant.
To run with charges: comment out lines 69-139 in functions.R and make sure the output is saved as “Table_construction.Rdata”. To run with arrests: keep lines 69-139 in functions.R and make sure the output is saved as “Table_construction_arrests_hist_violence.Rdata”
The other scripts should read in either “Table_construction.Rdata” OR “Table_construction_arrests_hist_violence.Rdata” (one only!), depending on if you want the data processing to use charges or arrests.
-
db2csv.r converts each table in compas.db to a csv file. Run this before running Table_construction.rmd.
-
Table_construction.rmd processes the raw data into the features used for our analysis. The output is stored in Table_construction.rdata.
-
predict_scores.rmd and predict_scores_violent.rmd predict general and violent COMPAS raw score remainders, respectively. Also included are the fitted age polynomial and the logistic regression that includes age, sex, and recidivism as features.
-
predict_recidivism.rmd and predict_recidivism_violent.rmd predict general and violent two-year recidivism, respectively.
-
age_only.rmd contains a few plots that investigate the impact of age at COMPAS screening date.
-
functions.r includes a number of custom helper functions.
-
app.r is a Shiny app (see below).
There are a few different groups of features used in predicting the COMPAS raw score remainders and two-year recidivism:
- Group 1: Does not include age at screening date or race.
- Group 2: Includes race but not age at screening date.
- Group 3: Includes age at screening date but not race.
- Group 4: Includes age at screening date and race.
- Group 5: Includes age at screening date and race as well as total number of charges and total number of arrests. Used only for COMPAS raw score remainder prediction.
These groups are referenced in the code.
Depending on the analysis, some observations are discarded. For example, some individuals were given a COMPAS score less than two years before the data was pulled, so we cannot asses whether or not they committed a new crime within two years. At most there are 12,381 unique people / COMPAS screening date combinations. The following filters are used at some point in the analysis, with the number of observations this filter removes from total number of observations (i.e., so some filters may overlap) for the general and violent scores given in paranthesis:
- Filter 1: Remove observations with -1 decile scores (removes 15 general, 15 violent).
- Filter 2: Remove all but African-Americans and Caucasians (removes 1929 general, 1929 violent).
- Filter 3: Remove observations with no current offense (removes 3331 general, 3331 violent).
- Filter 4: Remove observations without two years of data past screening date (removes 6615 general, 6615 violent).
- Filter 5: Remove observations with current age <= 18 or >65 (removes 739 general, 739 violent).
- Filter 6: Remove observations with nonzero subscale inputs, for the relavant subscales (removes 8239 general, 5658 violent)
- Filter 7: Remove all observations 0.05 below the age polynomial (removes 18 general, 11 violent).
- Filter 8: Remove all observations 0.05 below the age spline (removes 22 general, 12 violent).
The following results in the paper use the listed filters:
- f(age) and f(viol_age) fitting + plot: Filters 1, 5, 6, 7
- Logistic regression AND probability of reoffending plot: Filters 1, 3, 4
- Any COMPAS score prediction (whether or not reversed engineered components subtracted): Filters 1, 3
- Any recidivism prediction: Filters 1, 3, 4
- raw_score - f(age) vs. number or priors AND vs. criminal involvement plot: Filters 1, 3, 8 (but those filtered by 8 added to plot in green)
- TPR/FPR plot: Filters 1, 2, 3
- raw_score - f(age) vs. history of violence fitting + plot: Filters 1, 3, 8 (but those filtered by 8 added to plot in green)
- raw_score - f(age) - g(vio_hist) vs. history of noncompliance plot: Filters 1, 3, 8 (but those filtered by 8 added to plot in green)
- Age histograms AND mean/median age statistics: Filters 1, 2
We created a Shiny application to display criminal history data, COMPAS scores, and the features we constructed for each individual in the dataset. To choose an individual, type a person identification number and select a COMPAS screening date.
The Charge, Arrest, Jail, Prison, and COMPAS tabs show data made available by ProPublica. Information is separated by when it occured in relation to the COMPAS screening date (before the date, on the date, and after the date). The Features tab shows our constructed features. There is also some useful information in the Profile tab.