Code separation for storage location and storage format
tischi opened this issue · 9 comments
@joshmoore and I were thinking whether it would be possible (and make sense) to separate the code in terms of storage location and storage format.
So for example, one could have something like
new N5Reader( StorageLocation storageLocation, StorageFormat storageFormat )
with, e.g.,
enum StorageLocation {
FileSystem
AWSS3,
GoogleCloud
}
enum StorageFormat {
N5,
Zarr
}
This would of course mainly make sense if one would manage within the N5Reader
class to be able to mix and match the two things, thereby avoiding code duplication.
Not sure I am making sense?
Maybe @joshmoore can formulate this more professionally?
@axtimwalde and I talked about this recently.
This would involve big changes, and so won't happen soon, but I agree its desirable and should be on the roadmap.
@bogovicj : do you have any idea of a workaround? Our hope was to be able to access IDR data in OME-Zarr on S3 via the N5 stack for I2K? (Wow. The acronyms abound!) My best guess would be to make a copy/fork of n5-aws-s3, add a dependency on n5-zarr, and migrate the N5ZarrReader use of DType etc. into https://github.com/saalfeldlab/n5-aws-s3/blob/master/src/main/java/org/janelia/saalfeldlab/n5/s3/N5AmazonS3Reader.java (or a subclass).
Wow. The acronyms abound!
😂
make a copy/fork of n5-aws-s3, add a dependency on n5-zarr,
Either that or the reverse - fork n5-zarr and depend on n5-aws-s3. Not sure which would be less work in the end...
Were I to try to hack something together quickly, I think I might start with N5ZarrReader
, rip out any filesystem calls, and try to replace with the appropriate stuff in N5AmazonS3Reader
Gotcha. The only other suggestion that has also been discussed on the Zarr side (zarr-developers/zarr-python#540) would be a wrapper strategy roughly of the form:
n5 = new N5ZarrWrapper(new N5AmazonS3Reader())
I think the most mature path forward is to introduce a separate interface for KeyValueStores
implement the N5Reader
and N5Writer
interfaces for various dialects (N5, Zarr, BrainMaps, Boss, ...) on top of this. This could capture file-system and cloud storage but would not be appropriate for HDF5 which is fine but noteworthy. Also, while this sounds trivial at first glance it becomes a bit icky when considering the differences in how the various backends lock (or don't lock), i.e. the currently straight forward file-system logic will become clunkier. This is mainly why I haven't touched it yet. Skipping file-system and doing it only for cloud-storage interfaces defeats the purpose...
@axtimwalde : hmmm.... could you sketch out a bit of the inheritance hierarchy that you'd see? Or is the design of that hierarchy part of the problem and therefore needs more time?
Is there any short-term path forward that you wouldn't consider monsterous?
As a short term hack, I would plug the AWS S3 access logic into a copy of n5-zarr as @bogovicj sugested, and add alternative readers and writers to the n5-zarr repo. It's not too much work and will be consistent with people using N5 through the N5Reader
and N5Writer
interfaces which they should. This would also cover all practical use cases that we currently have.
I've opened saalfeldlab/n5-zarr#5 as a draft. It's possible that this should never be merged and instead should exist in a separate branch and/or repository.
Closing this issue now that AmazonS3KeyValueAccess
exists for the purpose of achieving this separation.