SUSTAINABILITY: Data storage for the Canadian COVID-19 Data Archive
Opened this issue · 0 comments
jeanpaulrsoucy commented
The current storage infrastructure is an S3 bucket (https://s3.us-east-2.amazonaws.com/data.opencovid.ca/
/ http://data.opencovid.ca
, being charged to a personal credit card. S3 does not deduplicate files despite the very large number of files that are duplicates (because they are not updated every day). Obviously, the cost to maintain the archives increases as data are added and the number of requests rises. This is not sustainable.
In terms of a new backend storage solution, we have the following requirements:
- It must handle both text (e.g., CSV, JSON, HTML) and binary files (e.g., XLSX, PDF).
- It should handle file de-duplication (e.g., identical snapshots of the same file over multiple days should point to a single underlying copy of the data).
- It must integrate well with the proposed back-end API and front-end data tool.
- Would be nice to have the ability to easily
diff
text datasets.