datopian/ckanext-blob-storage

Migration script from ckanext-cloudstorage to this storage

Closed this issue · 1 comments

A script to migrate resources can be written based on the CKAN API (+ Python SDK, as it will need access to Git LFS as well as CKAN), or by directly accessing the DB and Azure Blob Storage. The former is most likely preferrable and easier though the latter may be faster.

Acceptance

  • Script to migrate data from one storage bucket (on Azure) to another
  • Script to update metadata in CKAN MetaStore
  • Test for this script (?)

Tasks

  • Get all resources that do not have lfs_prefix and sha256 attributes set
  • For each resource found, download the file
  • Calculate sha256 and upload the resource to blob storage via Git LFS server NOTE: we probably want to use azure copy commands rather than download and upload as data is large. We still need to download to calculate the sha
  • Update the resource lfs_prefix and sha256 attributes
  • Iterate until no more resources are found
  • Test out ...

This can easily be parallelized if we need to (e.g. if we need to run fast if the system is taken down) or run slowly in the background. It can also be restarted if needed and will continue from where it stopped.

Script has been merged in #40. Note that we do download and upload and not use any Azure APIs because we don't want to have anything Azure specific. We use CKAN's fallback download mechanism to download files, so whatever storage mechanism is behind it is abstracted away. Same goes for upload - ckanext-blob-storage has no knowledge or dependency on Azure.