Data-hotfixing without master branch commits
Opened this issue · 7 comments
Not sure what this is about.
When the model runs in production, it often fails for some states because of artifacts in the data (#8).
The approach so far is fixing these data points in the code:
https://github.com/rtcovidlive/covid-model/blob/master/covid/data.py#L27:L81
This hotfixing makes master-commits that bypass a code-review process and are one reason why the latest master runs in production.
To uncouple production from latest master (#14) we should find a way to do the manual outlier correction without changing the code on the master branch all the time.
One option would be to read the corrections from a GitHub Gist CSV file or Google Sheet.
I like the gist CSV idea; something like
region | start date | end date | override_value
for the headers?
I like the gist CSV idea; something like
region | start date | end date | override_value
for the headers?
Looks like edit permissions on gists can't be shared.
A wiki page?
Including the date of making the correction would be good for realistic backtesting.
I'm going to have to think about this a little because sometimes the corrections are simple as a line of code, but not as simple if left to a CSV. For instance:
data.loc[idx["LA", pd.Timestamp("2020-06-19") :], :] += 1666
How would one do that in a CSV? (add 1666 to every value after 2020-06-19?) We can, of course, create our own DSL but that seems like overkill too.
I'm wondering if we're trying to fix something we don't actually need to fix. I think we can sort out the "which version runs in production" by simply tagging a production release. Hotfixes happen and I don't know that they're necessarily "bad". Also, if we move this to a CSV, it can become quite brittle because we have no history of the changes to the CSV and if anyone else is using this model then their results can change without warning which seems problematic.
An additional reason to keep in code is that I think changes to the data going forward may be algorithmic rather than scalar. For instance, I posted the following in the slack channel today:
def adjust_totals(model_data, threshold=10):
excess = model_data.total-model_data.positive
adjusted = pd.Series(0, index=model_data.index)
zero_days = []
for idx, row in model_data.iterrows():
if excess[idx] <= threshold:
zero_days.append(idx)
elif len(zero_days):
share = excess[idx]/(len(zero_days)+1)
for day in zero_days:
adjusted[day] = share
adjusted[idx] = share
zero_days=[]
else:
adjusted[idx] = row['total']
return adjusted
There's no way changes like this will fit into a CSV :)
I'd recommend making data changes transparent, in code, and as part of the git history for these reasons. Additionally, we may actually want a per-region sanitizer file (eg. US_OK.py, US_MT.py) where we can more clearly/cleanly show exactly what we're doing to the data of each state.
Hi! I'd like to contribute to this issue if possible!
I'd recommend making data changes transparent, in code, and as part of the git history for these reasons. Additionally, we may actually want a per-region sanitizer file (eg. US_OK.py, US_MT.py) where we can more clearly/cleanly show exactly what we're doing to the data of each state.
Referring to this, perhaps you have one file that contains separate cleaning functions (one per state). In addition we could use a dictionary mapping state identifiers (e.g. "US_MI") to the appropriate cleaning function, and have an overarching function that iteratively calls the clean function for each state. In data_US.py
we could call the top function.
I don't think creating a per-region file is bad (rather I prefer each region knowing how to clean itself), but feel that it could cause clutter, even if we were to put those region-specific files in their own submodule!
EDIT: Perhaps something like this, and we call all of that in data_US.py
like this?
Not to double post, but noticing that some PRs mentioned adding data_XX.py
for other countries (Israel & Germany), it might be a good idea to package all the data processing files into a submodule and then pre-register those countries' loaders in data.py
.
If we take this a bit further, and put all the data processing code in its own repository & python package: then we wouldn't need to update this repo every time we add more cleaning/processing steps. We could also a function to get the Loaders dict, and then copy key/value pairs here (allowing cloners of the repo to still add their own functions from third party modules if they choose to). One issue that I see with this approach is that cloners have less visibility over the data processing steps and would have to go to another repo to see what's being done to the data (our documentation would have to be clear about how to find this information). On the other side, people who are building their own models but want to use our data cleaning steps can now import that package alone.