Try to merge shootings-append.csv into shootings-merge.csv

Question

Try to merge shootings-append.csv into shootings-merge.csv

Closed this issue 7 years ago · 17 comments

shootings-append.csv appends data from 2 data sets on police shootings released pursuant to FOIA requests submitted in April and May 2016.

This data is dynamic, not static. For example, the status or details about a police shooting may change as an investigation progresses. There may also be other inconsistencies between the data sets.

We need to merge the shootings-append.csv data into a combined file, shootings-merge.csv. You can use your GitHub username or initials to identify your attempt at merging the data. For example, Matt Li (fictional person) might name his solution shootings-merge-ml.csv.

Feel free to use your own tools to merge the data. One contributor is using Stata, but you can also use Python or R. We hope that through multiple parallel attempts at merging the data, we can catch each other's mistakes.

To quote @rajivsinclair: "We are all about redundancy. We can't have enough of that. Do it your own way, look at our start, look at our finish, and compare the results. Help us identify the flaws in our approach."

We are going to work on documenting shootings-append.csv so the meaning of each column is clear. Please share questions about the data in this thread.

Answer 1 · 2016-06-05T21:07:18.000Z

To clarify: what exactly is meant here? Do we have two of (ostensibly) the same dataset with slightly conflicting information? Is there a reason why they are appended in one file rather than in two separate files?
Is csv the best format for storing this data? Since there can be multiple people involved for the same complaint number, would a hierarchical format like json be better?

Answer 2 · 2016-06-05T21:26:10.000Z

Hi @jayqi! Thanks for the excellent questions.

These are NOT the exact same dataset. Many of the columns should (in theory) be the same from month to month. But because these files describe ongoing investigations, we expect some of the columns to change over time, such as CURRENT_CATEGORY and CURRENT_STATUS. Take a look at COLUMN-DICTIONARY.md, where we are currently working on documenting what each of the columns mean.

My understanding of project goals (copy-pasting from the README): "Our first goal is to use the FOIA’ed shootings data here to identify officers per incident log number in the June 3 trove."

But you raise interesting questions: is it crucial to merge the FOIA dumps before we try to link officers with incident log numbers? If so, why?

Maybe we could skip the merging step if it lets us deliver value quicker. Curious to hear what you think @rajivsinclair @ithinkidunno.

Answer 3 · 2016-06-05T21:30:22.000Z

If the May set is more recent, why not just use the May set? What information is contained in the April set that the May set is missing?

Answer 4 · 2016-06-05T21:35:17.000Z

@jayqi Yes, great question. Let's dive into the data to find out. My understanding from @rajivsinclair and @ithinkidunno is that there may be some missing information in the May set that could be complemented by merging with the April set. But I would agree that it couldn't hurt to start with the May set and see what we find.

Answer 5 · 2016-06-05T21:37:25.000Z

Same with json. Your json suggestion sounds very sensible to me. I am curious what @rajivsinclair and @ithinkidunno think.

json would make a lot of sense for columns that we expect to change over time, too:

// Incident

{
  "involved officers": [ ... ],
  "accused officer": { ... },
  "categories": [
    {
      "date": ...,
      "category_code": ...,
      "category_text": ...
    },
    {
      "date": ...,
      "category_code": ...,
      "category_text": ...
    }
  ]
}

Answer 6 · 2016-06-05T21:48:54.000Z

The May and April files are different. This is why I appended them rather than merging them

Answer 7 · 2016-06-05T21:50:09.000Z

the April one has a bigger universe of complaints, whereas the May only had the Shootings.

Answer 8 · 2016-06-05T22:02:21.000Z

OK, this is useful @ithinkidunno.

Strikes me that a couple simple counts would be useful:

of unique incident IDs in the June 3 dump
ditto for the May data
ditto for the April data
ditto for the Feb data

... and how much overlap exists between the incident IDs in each pair of data sets.

That could let us answer questions like:

"Can the April data set alone let us link most of the June 3 incidents to CPD officers?"

Answer 9 · 2016-06-07T04:25:14.000Z

These don't seem to have definitions which is off since these columns should have the most relevance for the merging. Don't we need more context (from COLUMN-DICTIONARY.md)

 __start_code_2:__
__start_code_1:__
__rec:__

edit: this is issue #9

Answer 10 · 2016-06-07T17:06:47.000Z

@freddymartinez9 Yup, we definitely need to provide more context here. The column documentation should be complete (at least a first draft) by tonight.

Answer 11 · 2016-06-07T21:50:51.000Z

@freddymartinez9 We finished documenting the columns in #13.

Answer 12 · 2016-06-07T21:53:57.000Z

@jayqi Are you coming to ChiHackNight tonight? Loved the ideas you shared here, would be excellent to meet up and work together on this in person. Let me know. If you want to chat, my email is in my profile.

Answer 13 · 2016-06-10T20:28:42.000Z

I have another version of this that I can submit a PR for if you want, although we might want to wait until we have a consensus on how we're treating May. You can take a look at the columns for april and may that I combined / renamed here, though, in the section marked "Cleaning / Combining April and May":

https://github.com/DGalt/shootings-data/blob/dev/april_may_munging.ipynb

Answer 14 · 2016-06-14T02:58:47.000Z

@DGalt Another version of shootings-append.csv? I think a PR for that would be very welcome. Did you get a chance to compare it with the original shootings-append.csv? Any major differences?

I noticed this line in the iPython notebook:

sht_apd = pd.read_csv('./Raw/shootings-append.csv', low_memory=False)

... which seems to be reading @ithinkidunno's append for a comparison attempt, but I didn't see the sht_apd variable later on. I may also be misreading this!

Answer 15 · 2016-06-14T03:03:15.000Z

@DGalt The Exploring Data section is extremely interesting.

Couple of questions:

I wonder how the results line up against https://cpdb.co/data/bP8dwb/citizens-police-data-project?
In the last table, what should we make of nan?
The last table seems to be complaints with the most involved officers, is that correct? Creative way to slice the data, I would never have thought of that. I'm curious about those incidents: 1055981, 1070415, 1055068...

Answer 16 · 2016-06-14T14:02:48.000Z

you're right, I load @ithinkidunno's shootings-append but don't use it again. This is a cleaned up version of an earlier notebook where I was comparing the two, but I didn't include the comparison here. One major difference is @ithinkidunno's file only includes a subset of the data available from the April data set that corresponds to shootings and tasing (these are the events where the incident code begins with either "18", "20", or "S"). I do not do that here, as it's not clear yet whether we're focusing on those incidents or all possible incidents. It would be easy enough, though, to filter those particular matches out of my file and compare the two different version.

As for your other questions:

I haven't looked at the cpdp dataset, but it would definitely be interesting to compare the two.
The nan values correspond to a nan in the accused_officer_fname or accused_officer_lname column. I should probably re-do this so that if there is nan in one of those columns but not the other that's actually represented in the data.
Yes, that is correct. I can provide the full dataset (for both this and for sorting by the number of incidents each officer is involved in). If particular incidents look interesting it's easy enough to filter them out and look at them specifically from the combined dataset.

Answer 17 · 2016-07-06T00:16:02.000Z

Merging in another related issue:

Right now, shootings-append.csv includes only data from the April and May files.

The February file includes only three types of incident categories: 18A, 18B, 20A. It uses a very different column schema from the April and May files, and it has less information that the other two files.

We would like to merge it in to shootings-append.csv, but we need to figure out how to accurately map the column schemas onto each other given that February has less information.

of unique incident IDs in the June 3 dump