aiidateam/disk-objectstore

allowing multiple pack storage locations

zhubonan opened this issue · 5 comments

One problem I face with my current AiiDA-based workflow is the growing size of the repository verses the finite size of the fast SSD storage. This can happen quite quickly if I had to run a few "large" caclulations for which a lot of data is needed during post-processing and provenance critical . In theory, most of the files stored by AiiDA are not frequently accessed and they are perfectly fine to sit on a slow storage position, e.g. spinning disk or NFS mounts. On the other hand, having the whole repository on a slow storage location can slow down the daemon and workflows.

I think this package can give a natural solution to this problem. Here, the loose "objects" can be written onto a fast-to-write disk. The read-only access of the "fully" packed packs no longer benefit from fast disk speed, so they can be moved into a slow storage if needed, e.g:

  • loose files -> objectore folder on fast SSD
  • not fully packed pack file -> objectore folder on fast SSD
  • full pack file with only read access -> addition folders on slow storage location

At the moment, all of the (integer numbers) packs are stored under the packs folder, would it be possible to allow multiple storage positions to be used (for fully "packed" ones)? I think it should just be a matter of iterating over the storage locations and check if the file exists, or a dictionary of pack id and their locations can built when the Container class is instantiated to reduce the overhead.

Please let me know what do you think about thsi idea. Thanks!

Proof of concept PR #126

Pinning @giovannipizzi @chrisjsewell

After discussion with @zhubonan and @chrisjsewell the following design could be envisaged:

  • add a new subfolder inside the container, called archived-packs
  • add a table in the SQLite DB, say ArchivedPacks, that has just two columns, the pack_id and the location (there should be a unique constraint on the pack_id column): existence of a pack id in this column means the pack should not be looked for into the packs subfolder, but in the location subfolder, that by default is archived-packs
  • we should provide api (and possibly also dostore cmdline commands) to move a pack to the archived directory (possibly with a custom name, and checking that this does not overlap with known names like sandbox and loose (or, each folder should be inside archived-packs/<LOCATION> where <LOCATION> is the value of the location column. This would take care of moving the pack in a way that is aware that the destination might be in a different file-system: e.g. first check that the archive is sealed (see issue #124, we should define the concept of a "sealed" pack and only move that, and disallow to add to that pack afterwards); then copy it over; then (after checking the MD5 to ensure the pack was successfully copied?) add the entry in the ArchivedPacks table; then (maybe as a maintenance operation) remove the pack from packs and only keep the archived version.
  • there should be some command line to know where a pack is, and/or which archived locations exist
  • in the reading part, when an object is in a pack, if the pack is also in the ArchivedPacks table, then it is loaded from there and not from the packs/ folder.
    • one note: also the function to get the pack to write to should avoid to recreate a pack named 0 if the pack exists an is archived.

As a power user, I can then create folders inside archived-packs and mount them from some remote location.
In this way, archiving will allow to move big data to other locations.

In addition, there should be a function to check that all packs are actually there (e.g. to avoid that one of the archived folders is not mounted - and ideally also add the checksum for further validation?). The simple check of file existence should hopefully be fast, and should be done every time you create a new container instance, otherwise an exception is thrown?

Finally, it should be easy for the user to archive the packs. E.g. one could have a command dostore archive-packs --keep-last=2 [--location=nfs], where --location might be optional and we might have a default location like archive; the command will take all unsealed archives, keep the last 2 in the packs/ folder, and "move" all the rest to the archived-packs folder as described above.

@giovannipizzi Thanks for the summary!

One potential issue I can think of with this is that if the user have multiple profiles and hence multiple repositories, one can potentially make mistakes when mouting the correct folder inside archive-packs to the right disk-objectstore container. If such misktake is made, my impression is that the current implementation would return an incorrect stream?

At the moment the packs are stored as numbered files, eg. 1, 2, ,3, would it make sense to add some kind of identifier to the pack file names, such as 1_<uuid-of-container> to avoid potential errrors?

good point, thanks! Either that, or have a JSON in the folder that gives information. But I agree

After re-discussion with @zhubonan we realized that the logic described here is probably too complex. Probably the easiest is to mount just the packs subfolder in a different location. This should typically be sufficient for most use cases. I will therefore close this as a wontfix