SUSTAINABILITY: Data storage for the Canadian COVID-19 Data Archive

Question

SUSTAINABILITY: Data storage for the Canadian COVID-19 Data Archive

Opened this issue 3 years ago · 0 comments

The current storage infrastructure is an S3 bucket (https://s3.us-east-2.amazonaws.com/data.opencovid.ca/ / http://data.opencovid.ca, being charged to a personal credit card. S3 does not deduplicate files despite the very large number of files that are duplicates (because they are not updated every day). Obviously, the cost to maintain the archives increases as data are added and the number of requests rises. This is not sustainable.

In terms of a new backend storage solution, we have the following requirements:

It must handle both text (e.g., CSV, JSON, HTML) and binary files (e.g., XLSX, PDF).
It should handle file de-duplication (e.g., identical snapshots of the same file over multiple days should point to a single underlying copy of the data).
It must integrate well with the proposed back-end API and front-end data tool.
Would be nice to have the ability to easily diff text datasets.