Try to merge shootings-append.csv into shootings-merge.csv
Closed this issue · 17 comments
shootings-append.csv
appends data from 2 data sets on police shootings released pursuant to FOIA requests submitted in April and May 2016.
This data is dynamic, not static. For example, the status or details about a police shooting may change as an investigation progresses. There may also be other inconsistencies between the data sets.
We need to merge the shootings-append.csv
data into a combined file, shootings-merge.csv
. You can use your GitHub username or initials to identify your attempt at merging the data. For example, Matt Li (fictional person) might name his solution shootings-merge-ml.csv
.
Feel free to use your own tools to merge the data. One contributor is using Stata, but you can also use Python or R. We hope that through multiple parallel attempts at merging the data, we can catch each other's mistakes.
To quote @rajivsinclair: "We are all about redundancy. We can't have enough of that. Do it your own way, look at our start, look at our finish, and compare the results. Help us identify the flaws in our approach."
We are going to work on documenting shootings-append.csv
so the meaning of each column is clear. Please share questions about the data in this thread.
- To clarify: what exactly is meant here? Do we have two of (ostensibly) the same dataset with slightly conflicting information? Is there a reason why they are appended in one file rather than in two separate files?
- Is
csv
the best format for storing this data? Since there can be multiple people involved for the same complaint number, would a hierarchical format likejson
be better?
Hi @jayqi! Thanks for the excellent questions.
-
These are NOT the exact same dataset. Many of the columns should (in theory) be the same from month to month. But because these files describe ongoing investigations, we expect some of the columns to change over time, such as CURRENT_CATEGORY and CURRENT_STATUS. Take a look at COLUMN-DICTIONARY.md, where we are currently working on documenting what each of the columns mean.
My understanding of project goals (copy-pasting from the README): "Our first goal is to use the FOIA’ed shootings data here to identify officers per incident log number in the June 3 trove."
But you raise interesting questions: is it crucial to merge the FOIA dumps before we try to link officers with incident log numbers? If so, why?
Maybe we could skip the merging step if it lets us deliver value quicker. Curious to hear what you think @rajivsinclair @ithinkidunno.
If the May set is more recent, why not just use the May set? What information is contained in the April set that the May set is missing?
@jayqi Yes, great question. Let's dive into the data to find out. My understanding from @rajivsinclair and @ithinkidunno is that there may be some missing information in the May set that could be complemented by merging with the April set. But I would agree that it couldn't hurt to start with the May set and see what we find.
Same with json
. Your json
suggestion sounds very sensible to me. I am curious what @rajivsinclair and @ithinkidunno think.
json
would make a lot of sense for columns that we expect to change over time, too:
// Incident
{
"involved officers": [ ... ],
"accused officer": { ... },
"categories": [
{
"date": ...,
"category_code": ...,
"category_text": ...
},
{
"date": ...,
"category_code": ...,
"category_text": ...
}
]
}
The May and April files are different. This is why I appended them rather than merging them
the April one has a bigger universe of complaints, whereas the May only had the Shootings.
OK, this is useful @ithinkidunno.
Strikes me that a couple simple counts would be useful:
-
of unique incident IDs in the June 3 dump
- ditto for the May data
- ditto for the April data
- ditto for the Feb data
... and how much overlap exists between the incident IDs in each pair of data sets.
That could let us answer questions like:
- "Can the April data set alone let us link most of the June 3 incidents to CPD officers?"
These don't seem to have definitions which is off since these columns should have the most relevance for the merging. Don't we need more context (from COLUMN-DICTIONARY.md)
__start_code_2:__
__start_code_1:__
__rec:__
edit: this is issue #9
@freddymartinez9 Yup, we definitely need to provide more context here. The column documentation should be complete (at least a first draft) by tonight.
@jayqi Are you coming to ChiHackNight tonight? Loved the ideas you shared here, would be excellent to meet up and work together on this in person. Let me know. If you want to chat, my email is in my profile.
I have another version of this that I can submit a PR for if you want, although we might want to wait until we have a consensus on how we're treating May. You can take a look at the columns for april and may that I combined / renamed here, though, in the section marked "Cleaning / Combining April and May":
https://github.com/DGalt/shootings-data/blob/dev/april_may_munging.ipynb
@DGalt Another version of shootings-append.csv? I think a PR for that would be very welcome. Did you get a chance to compare it with the original shootings-append.csv? Any major differences?
I noticed this line in the iPython notebook:
sht_apd = pd.read_csv('./Raw/shootings-append.csv', low_memory=False)
... which seems to be reading @ithinkidunno's append for a comparison attempt, but I didn't see the sht_apd
variable later on. I may also be misreading this!
@DGalt The Exploring Data section is extremely interesting.
Couple of questions:
- I wonder how the results line up against https://cpdb.co/data/bP8dwb/citizens-police-data-project?
- In the last table, what should we make of
nan
? - The last table seems to be complaints with the most involved officers, is that correct? Creative way to slice the data, I would never have thought of that. I'm curious about those incidents: 1055981, 1070415, 1055068...
you're right, I load @ithinkidunno's shootings-append but don't use it again. This is a cleaned up version of an earlier notebook where I was comparing the two, but I didn't include the comparison here. One major difference is @ithinkidunno's file only includes a subset of the data available from the April data set that corresponds to shootings and tasing (these are the events where the incident code begins with either "18", "20", or "S"). I do not do that here, as it's not clear yet whether we're focusing on those incidents or all possible incidents. It would be easy enough, though, to filter those particular matches out of my file and compare the two different version.
As for your other questions:
- I haven't looked at the cpdp dataset, but it would definitely be interesting to compare the two.
- The
nan
values correspond to anan
in theaccused_officer_fname
oraccused_officer_lname
column. I should probably re-do this so that if there isnan
in one of those columns but not the other that's actually represented in the data. - Yes, that is correct. I can provide the full dataset (for both this and for sorting by the number of incidents each officer is involved in). If particular incidents look interesting it's easy enough to filter them out and look at them specifically from the combined dataset.
Merging in another related issue:
Right now,
shootings-append.csv
includes only data from the April and May files.The February file includes only three types of incident categories: 18A, 18B, 20A. It uses a very different column schema from the April and May files, and it has less information that the other two files.
We would like to merge it in to
shootings-append.csv
, but we need to figure out how to accurately map the column schemas onto each other given that February has less information.