ahupowerdns/covid-backend

Cache-able backend consumption

Opened this issue · 4 comments

sdktr commented

It would be SUPER ACE if we could put the data files on a CDN and not have to worry about them anymore. The initial data file would be quite large, and you'd have to apply delta-sets to them.

Absolutely agree that for this to scale we should provide cache-able files to allow for CDNs to step in. Each file covering 'a' time-range. Aggregated files for a set of delta's as well, example available at 2020-04-12 11:30:
hourly/20200412-1100.ext<-- contains updates between 11:00 - 11:30
hourly/20200412-1000.ext
hourly/20200412-0900.ext
hourly/20200412-0800.ext
all the way to
hourly/20200412-0000.ext
daily/20200412.ext <-- contains updates untill 11:30
daily/20200411.ext

We can also stick to the day-ranging scheme as proposed by Apple and Google in their draft proposal, which is the number of seconds since the Unix Epoch divided by the number of seconds in a day (24 * 60 * 60) floored. This can easily be extended to a hour-ranging scheme of course.
In that way you avoid all the issues related to time zones etc.

sdktr commented

We can also stick to the day-ranging scheme as proposed by Apple and Google in their draft proposal, which is the number of seconds since the Unix Epoch divided by the number of seconds in a day (24 * 60 * 60) floored. This can easily be extended to a hour-ranging scheme of course.
In that way you avoid all the issues related to time zones etc.

This means a daily file which should be fetched multiple times a day if you want intermediate updates? That's quite some redundant data, even when assuming something like an 'Etag' is used to know whether additions have been made since the last request?

It's also possible to calculate the amount of hours since Unix Epoch of course. In that way you can publish a file per hour.

I also suggest to not append to an existing file but publish the file after the time period for that file is over so the file is complete and won't get any updates any more. When using a CDN it's very hard to publish updates on files.

We can also stick to the day-ranging scheme as proposed by Apple and Google in their draft proposal, which is the number of seconds since the Unix Epoch divided by the number of seconds in a day (24 * 60 * 60) floored. This can easily be extended to a hour-ranging scheme of course.
In that way you avoid all the issues related to time zones etc.

I think it would be wise to stick to the day-ranging scheme, as that is simplest and in line with the standard.

But I do want to push out diagnosis keys more often.

As Apple and Google state that rolling identifiers should last 10 minutes minimum and 20 minutes maximum, combined with the ease of setting a crontab, once per hour seems like a reasonable trade-off. Note that I have no strong technical arguments to set it at one hour.

This means a daily file which should be fetched multiple times a day if you want intermediate updates? That's quite some redundant data, even when assuming something like an 'Etag' is used to know whether additions have been made since the last request?

I suggested using a singed checksum file with sha256sums file as an index-file in a different issue.
These files will have about hours * days * hours lines in them, so about 24 * 14 * 24 = 8064. With sha256-sums this will roughly be 800 KB, which GZip will flatten to about 400 to 500 KB.
In the grand scheme of things, this is not much if we compare it to the amount of data we have to send out for the key files.

Furthermore I'd like to point out that the HTTP-protocol has a HEAD-request option and that we could also publish a small text-file which contains the timestamp of when we ran the last update. It's only contents have to be "2020-04-13T16:50:00+00:00" which is way smaller than the HTTP-request a user has to send out to check for an update.

I also suggest to not append to an existing file but publish the file after the time period for that file is over so the file is complete and won't get any updates any more. When using a CDN it's very hard to publish updates on files.

I fully agree with this, because "once it's out the door, it's out the door and you can't get it back".
What we push out, should be final, for our sake, but also for the sake of the networks of the ISP's which will probably be caching these files as well.