How to test the correctness of the data?
adborden opened this issue · 7 comments
Conversation about our testing strategy. How do we know the data is correct? We want to know when the data is inaccurate. How might we create assurances that our calculations are correct?
In some cases, there is an error in the filings, but that is the data we have. Is there anything we can do for this case?
Tiered approach
There's several steps in our ETL, from downloading to data cleaning to import, to hepler views, to calculations for different entities. We could implement assertions at each level (tier) in the ETL process to catch errors early.
I'm not sure if we have a sense of at what level our errors are coming from. I supsect it's in the last step of calculations, which is the bulk of the process. I wonder if we can split it into smaller chucks that are simpler and easier to make assertions against.
Sentinels
We have expected data/calculations and compare them to the actual data/calculations. This is similar to what we were manually doing--diffing the build output and looking or unexpected changes. This was sometimes hard because when the data updates, the numbers change so it's hard to tell that a mismatch is from data vs a change in the calculation.
One problem with sentinel checks is that it doesn't tell you what the error is, only that there is an error.
Thanks @adborden for starting this conversation.
One place we could create an automated comparison would be to compare the total when each filing period is summed against the year-to-date total on the most recent filing plus any subsequent 24-hour filings.
Once source for differences may be that OpenDisclosure sums the total contributions and total expenditures for each filing period. If a committee edits prior transactions and then doesn't amend the prior filings those totals won't be updated in the data for the earlier periods, but it will affect the overall total reported on subsequent filings.
On filer error, we do have a disclaimer in our FAQ. But some problems aren't a genuine filer error, but rather difficulties caused by the way the data is reported. I think a very generic statement is better with a link to our note stating we don't clean the data.
Agreed, there are a lot of different things we can/should be testing here.
The first tests, which I just merged yesterday in #127 , seek to ensure that our calculations match our mental models of what is happening. We do this by creating trivial, static, test cases that are much easier to reason about. This is purely to test that our code does what we think it does.
I'd like to propose that we create a distinction between "tests" and filer QA scripts that can detect filer error.
For the "tests" that we write, the goal should be that for any individual change to OpenDisclosure, all of the tests are passing, which indicates that there are no unexpected changes in behavior.
For filer QA, we can do whatever we want. If there are particular things we want to check, and report on them somehow, sure, let's do that. But it's not in our power to fix QA issues, so I'd love to separate them conceptually from "tests".
Right now, IMHO, our priority should be to add more test cases, especially as we hit buggy times of the filing season (i.e. when 496/497 data needs to be de-duplicated). When we get a good set of tests that make sure that the numbers we're calculating for ballot measures and candidates are right, then that should free up Suzanne from having to manually check all the numbers all the time. That'd be nice, wouldn't it? 😎
QA wishlist:
- List of Ballot Measure Committees (and which ballot measures they contribute to) that aren't on the spreadsheet
- Which Filer IDs are re-used multiple times - i.e.
1364564
is both a recipient committee ("Lift Up Oakland for better wages") and a Ballot Measure committee ("Committee to Protect Oakland Renters - Yes on Measure JJ")