Priesemann-Group/covid19_inference

Clean the data_retrieval module

Closed this issue · 7 comments

Make the data_retrieval module consistent:
- data is retrieved with get_* with * being something like rki or jhu
- it is then filtered with different functions to get the data that one wants. These functions should be named filter_* and should all accept two datetime.datetime objects as begin and end of the data one wants. It should return a pandas dataframe with as index datetime.datetime objects and as column the different regions/countries

So for example for "filter_one_country()" we want the output to be the described panda dataframe instead of an array?

I was thinking about rewriting the module to make it a bit more uncluttered.

Create a class for each source (e.g. john hopkins university).

  • Loads the data at initialization from the online source as attribute. (The data size is quite small so that should be fine for now)
  • Methods to get desired filtered data in the different desired formats.

That should make everything clearer and easily expandable for more sources and other data structures.

Would love some feedback on this proposal.

Sounds good! I could start working on it :)

Yep, one class per data source seems like a good idea.

A few comments:

  • I agree that classes help unclutter the namespace, specially as number of sources grow and we need multiple filter_ functions to get what we want.
  • data should not be loaded at initialization though: RKI data is already at 13MB (up to 4k new entries per day), and possible future data sources could be much larger.
  • we should keep in mind that the module needs to be simple: by the nature of scrapping, things are prone to break often, and thus code needs to easily fixable.

I suppose that you meant to load it at the initialization of the object and not of the module. For this I see no problem. One could make an extra source argument in init if one already has the data downloaded locally.

I had now some time to look at what you have programmed. Some things that I would find good to change:

  • that every filter function accepts datetime.datetime objects as begin and end date, and not strings
  • That the output of the JHU has datetime.datetime objects as index. I think that's what currently to iso function is doing.
  • that the filter functions always return the new daily cases, and not the cumulative ones. That change would involve to calculate the difference in the jhu dataset. The data should then exclude the date date_begin, and include the date date_end (in order to have as length (date_end - date_begin).days).