Nonprofit-Open-Data-Collective/irs-efile-master-concordance-file

Clear definition of 'canonical' version

Closed this issue · 10 comments

Folks have generally said that version 2016v3.0 is the 'canonical' version, but that's obviously not the case for variables that don't exist in 2016v3.0. (It sorta looks like a 2015 version to me, in part because the new header stuff is missing, but I dunno).

For a variable that existed from 2010-2012, what would the canonical version be? Would it be the most recent or the first time it appeared. Has that been delineated, somewhere (and apologies if it has).

If I knew that, I could attach missing location codes, but without knowing which one to use I think you'd have to eyeball it? This doesn't matter 99 of 100 times, but it's annoying to get wrong.

Thanks @borenstein! Do you have a link to that? You mean as a way to fix the location codes? I think that would make sense.

That said, I'm skeptical of usage as an algorithmic "deciding factor" because it changes, right [I think that's the sense in which you mean attested]? Moreover, it's not explicit--I have to know all sorts of external stuff rather than picking something that's available with just the data in front of me.

My vote's going to be for the last version that a variable appears in. I end up grouping discontinued variables by vintage, in part because that's when I notice them most. But also the test for variable "currency" is if it equals the current version.

lecy commented

Jacob, " the last version that a variable appears" sound right.

The location code is a variable-level (as opposed to xpath-level) attribute to include in the data dictionary so the user can look up the field on the 990 form if necessary.

If you are referencing old forms, just note that somehow.

I think that's clear enough, thanks @lecy

lecy commented

If that is valuable info, we could just add the schema_location attribute as a separate column? Jacob has a nice way to extract these from the schemas.

xpath line_number
/IRS990/AccountantCompileOrReview [AccountantCompileOrReview] Part XI Line 2a
/IRS990/AccountantCompileOrReview [AccountantCompileOrReview] Part XII Line 2a
/IRS990/AccountantCompileOrReviewBasis/FinancialStatementBoth [AccountantCompileOrReviewBasis] Part XII Line 2a; [FinancialStatementBoth] Part XII Lines 2a and 2b
/IRS990/AccountantCompileOrReviewBasis/FinancialStatementConsolidated [AccountantCompileOrReviewBasis] Part XII Line 2a; [FinancialStatementConsolidated] Part XII Lines 2a and 2b
/IRS990/AccountantCompileOrReviewBasis/FinancialStatementSeparate [AccountantCompileOrReviewBasis] Part XII Line 2a; [FinancialStatementSeparate] Part XII Lines 2a and 2b
/IRS990/AccountantCompileOrReviewInd [AccountantCompileOrReviewInd] Part XII Line 2a

Hey @borenstein the logic will be (am travelling till next week), the source files are line_numbers.csv and descriptions.csv here: https://github.com/jsfenfen/shared_irs_docs