invinst/chicago-police-data

How to relate Feb, April, and May data sets?

Closed this issue · 1 comments

DGalt commented

A question that I keep coming up against is what is the most appropriate way, if at all, to combine and/or relate the different data sets that we have available to us. Just to briefly summarize what we have (this is in the wiki as well):

  • February data set: from CPD, composed of firearm discharge data
  • April data set: from IPRA, composed of police misconduct data
  • May data set: from IPRA, composed of shooting / tasing (+a few misc others) data

The unique-identifier column in February is Log No, while in April and May datasets it is Complaint_Number

Assuming that we can treat the values in Log No in the February dataset as equivalent to the values found in the Complaint_Number column found in the April and May datasets (@chaclynhunt, @rajivsinclair can you confirm / refute this):

  • There are 405, 7175, and 361 unique IDs (whether Log No or Complain_Number) in Feb, April, and May, respectively
  • For the Feb set: 236, 322, and 211 of the IDs do not exist in the April, May, and combined April+May datasets, respectively
  • For the April set: 7006, 7014, and 6903 of the IDs do not exist in the Feb, May, and combined Feb+May datasets, respectively
  • For the May set: 278, 200, and 175 of the IDs do not exist in the Feb, April, and combined Feb+April datasets, respectively

Overall the columns in the April and May sets largely correspond to each other, although there are several extra columns in May that do not exist in April (some work on trying to match the April and May columns can be found here, about halfway down the page)

February, in contrast, is a bit sparser in terms of overall data. While I think most of the columns in February can be matched to columns in April/May, there are a number of columns in April/May that do not exist in February.

One thing that might be worth looking in to is, for the unique identifiers that overlap across the datasets, how much of the data for those identifiers overlaps.

Considering, though, that these three datasets are produced by different sources (particularly reports of misconduct vs. the reports generated when an officer uses his/her firearm), I don't know that collapsing them into one large dataset is the best path forward. Or maybe it is, hence the need for discussion :)

Great point @DGalt. It seems to me there are about 3 basic ways to extract value from these:

  • Analyze each data set separately: Take each data set as an isolated slice and see what we can learn from each one.
  • Analyze overlap & conflicts between data sets: If the data sets contradict each other, that could be interesting to dig into.
  • Import all of them into the Chicago Police Data Project profiles: ... to build up a more complete profile for each officer. I believe this is what @rajivsinclair and co are working on at the moment.

All of these strike me as valid approaches to seeking out information & insights from the data.