Nonprofit-Open-Data-Collective/irs-efile-master-concordance-file

Needs to be merged

Opened this issue · 8 comments

A distinct class of variable problem we see are those that represent the same thing, but have been given different names, and thus need to be merged. When given the option, try to pick the variable_name used on the more recent version.

Hey @lecy here's an example of something that I think needs to be merged, which is the 'opposite' problem I was looking at with the scope_duplicates, which typically requires variable 'splitting'.

(I'm omitting the /Return/ReturnData/ below for brevity)

/IRS990ScheduleR/IdRelatedOrgTxblCorpTrGrp/USAddress/ZIPCd
is present in versions 2014v5.0;2014v6.0;2015v2.0;2015v2.1;2015v3.0;2016v3.0', and '/IRS990ScheduleR/IdRelatedOrgTxblCorpTrGrp/USAddress/ZIPCode' is in versions 2013v3.0;2013v3.1;2013v4.0'

But I see 'SR_04_PC_IROCTAZIP' for /IRS990ScheduleR/IdRelatedOrgTxblCorpTrGrp/USAddress/ZIPCd

and 'SR_04_PC_IROCTAZIPCOD' for /IRS990ScheduleR/IdRelatedOrgTxblCorpTrGrp/USAddress/ZIPCode

I think these should be the same thing. But note the variable_name has 'COD' appended in the second case above. So the merge I guess I'm suggesting would preserve the IROCTAZIP one and remove the ZIPCOD one, since the former is the more recent one.

@lecy @borenstein @MiguelABarbosa Here's a more complete list as an excel file of variables that I think should be merged. The change that would be made to the concordance would touch only the variable_name field (by changing variable_name_original to merge_into_variable_name); the xpath and versions would remain unchanged.

In general I'm replacing the older variable with the new one. I only ran this to 'fix' things from 2013 schemas forwards, there may be similar errors for earlier schemas that these proposed merges don't touch.

lecy commented

@jsfenfen this is great progress. Thanks for pushing forward with these cases. I have been slammed this week so sorry for the slow response.

I agree with your strategies here for splitting and merging variables.

I am sharing the cases we identified at the validathon. They are under the tab "Need Love". Here is the form participants completed:

https://goo.gl/forms/UpYGvwsJ3QCaNeno1

And the resulting spreadsheet (attached).

Validatathon Response Form (Responses).xlsx

I am not sure if there is a good way to cross-check these two lists, but just wanted to make sure you have this info for reference.

Yeah not sure of best way forward on this. Here's the diff of the merges I turned up by script, fwiw: jsfenfen@dc066b3

Also, @lecy, just eyeballing it, I don't think there's very much overlap between the validatathon response and my suggested merges, which are mostly addresses / zipcodes / states. If you were sure that the file is correct, I could probably pull out specific rows that called for merges and turn it into a similar 'merge file', if that makes sense.

There may be some overlap with my variable splitting commits? Some of them dealt with issues that were sorta orthogonal to this but still required fixing? I could be wrong too.

Yeah, so for instance F9_12_PC_FINSTATBOTH is dealt with here: jsfenfen@c30bdda though it also appears in the validatathon report (for the thing I fixed it for, I think).

lecy commented

I think the way forward is to continue with your current strategy of automated checks - finding ways to check the logic of variables using metadata and rules about what should be true if the concordance is correct.

When we have fixed as many issues as we can, then we start with the divide and conquer strategy of one human reviewing one Part of the 990 form, confirming the variables are consistent according to a set of rules provided, preview some of the data (David has generated a dataset and report for each xpath to facilitate this), and sign off that an entire Part has been validated.

I will have bandwidth in March to focus on the main parts from the 990-PC (I have no bandwidth right now). I would like to have a dataset for a conference in April.

I am sure we will continue to find problems moving forward, but as long as the general architecture of the concordance is solid and we have a clear protocol for how to identify bugs, they should iron out over time. Just keep track of which parts and schedules have not been validated by human, and provide the necessary caveats.

Does that sound feasible as a longer-term strategy?

@lecy that sounds reasonable--I put the merges in a pull request, lemme know if there's any reason not to merge these, I'll do so in a day or two if no one finds anything. I guess my question is about the quality of the validatathon spreadsheet. If we believe that all the rows that say they need to be merged should get merged as directed, I can run that too.