Model needs to correct for anomalies in testing reporting
Opened this issue · 1 comments
Increasingly states are reporting 100% positive tests on a given day (eg 215 of 215 tests came back positive). This throws the model off because it assumes positive rate of tests are roughly proportional to the actual number of tests. If the state reports 100% positive tests, Rt increases too quickly because of the faulty data point.
For instance, Ohio has a handful of days when clearly total tests have not been reported correctly and positive % shoots up to 100%:
And in some cases, tests are withheld one day, only to be reported together with the next day's results:
In this case, drops in data are often followed by 2x the number of tests the following day.
In either case, having an unstable positive % confuses the model significantly so we need to figure out a solution to either:
- Remove these anomalies and let the model infer the true hidden value
- Correct these anomalies using some kind of algorithm
Currently @tvladeck and I have looked at Gaussian Processes and Kalman Filters as ways of detecting and perhaps correcting these issues. Other ideas are welcome too.
I ran into the same issue with 0 followed by 2x cases when processing data from ECDC. After checking out both Guassian Processes and Kalman Filters, I settled on using Hampel filter: https://nbviewer.jupyter.org/github/gkossakowski/covid-19/blob/master/Realtime%20Rt%20mcmc.ipynb#Hampel-filter-for-all-countries
It tends to catch ~90% of anomalies in reported case numbers. For the remaining 10% I haven't found anything better than fixing up the data manually. It happens rarely enough that's not a big issue. The advantage of Hampel filter is that it's straightforward to understand its behaviour.