Cache downloaded data in `data_retrieval`

Question

Cache downloaded data in `data_retrieval`

Closed this issue 4 years ago · 8 comments

We should cache the data,
ideally between runs (module loads) or at least for the active session.

Answer 1 · 2020-04-21T08:55:40.000Z

If you are only talking about caching the download, we could check for the 'last-modified' header and only pull a new version if it is newer than a local one.

If you are taking about caching the get_* methods lru cache should work but I do not think this will get us much performance.

Answer 2 · 2020-04-21T10:45:12.000Z

I like @semohr 's idea, makes sense for the class to store the modified date.

This would need to be defined per source: from our current ones, only Google returns a last-modified in the headers. JHU we could scrape commit date I guess, and from RKI we could use Datenstand.

Google

urllib.request.urlopen('https://www.gstatic.com/covid19/mobility/Global_Mobility_Report.csv').headers.items()

Out[30]: 
[('Vary', 'Accept-Encoding'),
 ('Accept-Ranges', 'bytes'),
 ('Content-Type', 'text/csv'),
 ('Content-Length', '14090517'),
 ('Date', 'Tue, 21 Apr 2020 10:29:49 GMT'),
 ('Expires', 'Wed, 21 Apr 2021 10:29:49 GMT'),
 ('Cache-Control', 'public, max-age=31536000'),
 ('Last-Modified', 'Fri, 17 Apr 2020 00:18:22 GMT'),
 ('X-Content-Type-Options', 'nosniff'),
 ('X-Robots-Tag', 'noindex'),
 ('Server', 'sffe'),
 ('X-XSS-Protection', '0'),
 ('Alt-Svc',
  'quic=":443"; ma=2592000; v="46,43",h3-Q050=":443"; ma=2592000,h3-Q049=":443"; ma=2592000,h3-Q048=":443"; ma=2592000,h3-Q046=":443"; ma=2592000,h3-Q043=":443"; ma=2592000,h3-T050=":443"; ma=2592000)

JHU

urllib.request.urlopen('https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_recovered_global.csv').headers.items()
Out[33]: 
[('Connection', 'close'),
 ('Content-Length', '63631'),
 ('Content-Type', 'text/plain; charset=utf-8'),
 ('Cache-Control', 'max-age=300'),
 ('Content-Security-Policy',
  "default-src 'none'; style-src 'unsafe-inline'; sandbox"),
 ('ETag',
  'W/"eb2b872fe3dffa18fe5668d2145a8335faf912a1aa6d871153f1d52adda44a9f"'),
 ('Strict-Transport-Security', 'max-age=31536000'),
 ('X-Content-Type-Options', 'nosniff'),
 ('X-Frame-Options', 'deny'),
 ('X-XSS-Protection', '1; mode=block'),
 ('Via', '1.1 varnish (Varnish/6.0)'),
 ('X-GitHub-Request-Id', 'F8E4:63E6:311F2E:3DD53F:5E9ECCAA'),
 ('Accept-Ranges', 'bytes'),
 ('Date', 'Tue, 21 Apr 2020 10:42:40 GMT'),
 ('Via', '1.1 varnish'),
 ('X-Served-By', 'cache-hhn4058-HHN'),
 ('X-Cache', 'HIT, HIT'),
 ('X-Cache-Hits', '2, 1'),
 ('X-Timer', 'S1587465761.669682,VS0,VE1'),
 ('Vary', 'Authorization,Accept-Encoding'),
 ('Access-Control-Allow-Origin', '*'),
 ('X-Fastly-Request-ID', 'd96337c661ab3460bff4aa857b3507dc81c3132d'),
 ('Expires', 'Tue, 21 Apr 2020 10:47:40 GMT'),
 ('Source-Age', '81')]

Answer 3 · 2020-04-21T10:47:43.000Z

We should think about where to store it, perhaps in a folder data/ which is added to the .gitignore file. In general I find it a good idea

Answer 4 · 2020-04-21T12:16:51.000Z

Working on it 👍

Answer 5 · 2020-04-22T20:07:08.000Z

JHU still missing, right? looks great though!

Answer 6 · 2020-04-24T08:47:59.000Z

JHU is still missing since there is also no "last-modified" header for github. But one could get the last commit date for https://github.com/CSSEGISandData/COVID-19. We would need to use pygit for that, is it fine to add that?

On the other hand we wanted to change the rki date check to use the arcgis api.
I tried to only filter for the meldedatum here and check for the newest one in the list, but that feels kinda hacky too. Is there a way to only get the date the dataset updated the last time as query?

Additionally removing the os.path.getmtime() seems like a good idea since that is depending on the operating system and could lead to problems down the road.
I was thinking of creating a dict with the different last-updated dates and saving this dict to a file. (Has to be added to the gitignore)

Answer 7 · 2020-04-24T09:46:46.000Z

I think RKI is fine at the moment (I reworked it a bit, I don't know if you saw it @semohr ): current version uses Datenstand as last-modified, which is the only one comparable number between different sources.

For JHU, I suggest we just set auto_download = True , and fallback to local if it fails: files are very small (~200 kb total) and grow at a rate of ~600 new numbers per day, so they'll remain small.

Answer 8 · 2020-04-24T09:51:16.000Z

I did not see the RKI changes yet, but they look good.

Sounds like a good suggestion will work on that 👍