JVM Zarr implementation?
ryan-williams opened this issue · 27 comments
There isn't one, is there?
I've started making one, will post updates here.
In #285 there was mention of n5, which has a java and rust implementations, maybe more. n5
is similar in concept to zarr, apparently.
N5 is basically this. The specs differ a bit in minor ways. Convergence would be good to have. Some relevant discussion in issue ( https://github.com/zarr-developers/zarr/issues/231 ).
JVM implementation of Zarr would be very cool, particularly if it had the same flexibility as the Python implementation to plug in different storage back-ends including cloud object stores.
Thanks for all the pointers! I've looked a bit at n5 and z5; a couple questions:
- what are the tradeoffs of wrapping z5's C implementation for JVM use?
- we would call z5 via JNI, right?
- JNI seems to have a reputation for being hard/brittle; is that warranted? I'm not experienced with it.
- can z5 read/write directly to cloud stores?
- doing this in python (via
gcsfs
/s3fs
) and Java (via NIO adapters) seems to work well - otoh I've been stymied by python-wrapped C libraries that don't seem to allow this, e.g.
h5py
- aside: I'm not aware of a way to read HDF5 from cloud stores in python
- @tomwhite and I forked the netCDM library to support HDF5 IO in cloud stores via NIO, and i know that a JVM-native zarr impl would similarly handle this well
- doing this in python (via
I'm not aware of a way to read HDF5 from cloud stores in python
gcsfs's FUSE module does allow this, and there are other FUSE solutions out there too. The implementation is not at all performance compared to zarr. In addition, https://github.com/ContinuumIO/intake-xarray will shortly allow streaming of any xarray dataset, including hdf, from a server; again, there are other solutions that do something similar.
n5, which has a java and rust implementations, maybe more
z5 acts as a C++ and a python implementation for both zarr and N5
can z5 read/write directly to cloud stores
No, it's purely targeted at the file system format for both zarr and n5 as far as I know.
@ryan-williams there is already a bit of an ecosystem (albeit one tightly constrained to one institute...) rapidly evolving around the java N5 implementation, including a high-performance 3D data viewer, some image registration tools, and a volumetric image annotation suite. The java N5 already supports a number of backends, including the N5 filesystem format, HDF5, google cloud, and AWS (take a look here). It might make sense for a JVM implementation of the zarr file system format to take the form of an N5 backend (initially, at least) - that would potentially give all of those other tools access to zarr datasets for free, as well as saving you writing some of the higher-level boilerplate. That's if you're happy with the API, of course.
My feeling is that zarr has more momentum behind it and will have more impact in the future. Convergence would be great, but if the N5 tool ecosystem could get access to zarr file system arrays for free, that could also solve the problem.
I'd be really happy if Zarr and N5 converged on the same spec. It would make it much easier for people in this problem domain to collaborate more effectively on many other common challenges.
checking in here after a long gap!
I'm far along with a Zarr implementation in Scala, which will address the "JVM implementation" request here.
Some notes:
- it's in a branch that I am aggressively cleaning up atm; I'll send a link by Monday, but wanted to just mention now since other relevant discussions are ongoing.
- as one concrete use: I can directly convert HDF5 files to Zarr in "the cloud"
- currently: S3 or GCS (via Java NIO APIs; ABS doesn't have an NIO impl yet)
- AFAIK that's not otherwise possible today:
h5py
can't do direct cloud IO- various FUSE-based workarounds are brittle or missing features.
- @tomwhite added an NIO read-path to the netCDF Java lib, and that's what I use, along with my JVM Zarr impl, to do the conversion
- Incidentally, this Scala implementation will also provide a javascript implementation "for free", via scala.js
- I'm hoping to also compile it to native, via scala-native, but that's a at least another 6mos out (other libraries need to support scala-native first)
Looking forward to sharing more info on this shortly!
Excellent! I would love to build off of this work on the netCDF-Java side to provide an IOSP to Zarr (read Zarr into the Common Data Model). At that point, we could enable the THREDDS Data Server to serve data stored in Zarr :-)
Would you be open to that idea, and does the license permit such usage?
@lesserwhirls yea, it will be Apache-2.0 licensed, happy to have it feed into netCDF things!
It might be helpful/less painful for everyone if we get the changes made to netCDF-Java
made upstream. @tomwhite - would you be willing to contribute those changes?
@lesserwhirls, yes I'd be happy to. I'll open an issue/PR to discuss.
Hi @ryan-williams how is it going?
I'm far along with a Zarr implementation in Scala, which will address the "JVM implementation" request here.
Some notes:
- it's in a branch that I am aggressively cleaning up atm; I'll send a link by Monday, but wanted to just mention now since other relevant discussions are ongoing.
hello! I've been side-tracked, but what I have is here lasersonlab/ndarray.scala. it's pretty "alpha" still, and the issues reasonably capture the things I'm focused on next.
I'll be checking back in on this in the coming weeks, and will give some more updates here.
Just ran across https://github.com/bcdev/jzarr/blob/master/docs/tutorial.rst
cc: @SabineEmbacher
If you need array objects which behave almost like NumPy arrays you also can wrap the data using ND4J INDArray from deeplearning4j.org. You can find examples in the data writing and reading examples.
https://jzarr.readthedocs.io/en/latest/tutorial.html#writing-and-reading-data
Or directly in the code example
https://github.com/bcdev/jzarr/blob/master/docs/examples/java/Tutorial_rtd.java#L41
Can any of you tell me how to register the jzarr java library to the maven central repository. I've never done this before.
Does any of you have the time to guide or support me?
Best Regards
Sabine
Hi @SabineEmbacher. I don't remember what HOWTO we followed originally for our jars (cc: @sbesson) but https://stackoverflow.com/questions/28846802/how-to-manually-publish-jar-to-maven-central looks reasonable enough. The biggest hurdles I remember are (1) proving that you own your groupId (*.bc.com
) and (2) making sure that all of your dependencies are accessible from maven central. I've created bcdev/jzarr#4 since this may become protracted, but certainly happy to help. ~Josh
Following-up on #15 (comment), the process used by OME for releasing some of its Java components to Sonatype is documented here with the relevant links to OSSRH in case it's useful. If possible, big 👍 for having jzarr
available from Maven Central.
alimanfoo commented on 1 Aug 2018
JVM implementation of Zarr would be very cool, particularly if it had the same flexibility as the Python implementation to plug in different storage back-ends including cloud object stores.
Did you see the example of how to read and write to Amazon AWS S3 cloud storage using JZarr?
See:
https://jzarr.readthedocs.io/en/latest/amazonS3.html
and code example
https://github.com/bcdev/jzarr/blob/master/docs/examples/java/S3Array_nio.java
Completely missed this thread but wanted to mention that https://github.com/saalfeldlab/n5-zarr implements https://zarr.readthedocs.io/en/stable/spec/v2.html as an N5 backend since September 2019. This way it is available for array processing with ImgLib2 https://github.com/saalfeldlab/n5-imglib2 which has no size limits and built in memory caching, and is also the native data library for BigDataViewer and a bunch of processing tools that we use and build. n5-zarr includes blosc compression and locking and is included in the standard distribution of https://fiji.sc/. With the N5-API, talking to Zarr, N5, HDF5 is all the same.
There is currently no official cloud backend (other than through FS wrappers) for N5-Zarr because we haven't yet separated the interfaces for store and translation layers, i.e. writing a backend for HDF5 or Zarr is entangled with writing a backend for another store (like the AWS and GoogleCloud stores for N5). I remember that there was a fork that copied the n5-aws-s3 logic into n5-zarr as a temporary solution @joshmoore wasn't that you who did this?
I remember that there was a fork that copied the n5-aws-s3 logic into n5-zarr as a temporary solution @joshmoore wasn't that you who did this?
Yup, see saalfeldlab/n5-aws-s3#10 and saalfeldlab/n5-zarr#5
Yup. It then got copied into the bdv/mobie code base for @tischi's I2K work. Having a way to unblock all of that would be great. (Note: I only copied-n-pasted the reader side of things. Writing still needs work as far as I know.)
As with the rust focus during the Feb. 10th meeting, there may be a Java-leaning to the upcoming call this Wednesday if anyone is interested in joining to chat.
Thanks @joshmoore! I'll be there. Looking forward to seeing you all.