Extension to the HDF5 chunks API

Question

Closed this issue 4 months ago · 0 comments

Currently (v1.11.1.0), the treatment of HDF5 chunking is a bit inadequate:

Chunking can only be set on a per-Data object basis
Chunking can only be defined by explicitly setting the chunks shape on each axis
Chunking is ignored in an output file unless native compression is on
Chunks from an input file are not stored

A more comprehensive and flexible API is needed:

cfdm.write should chunk by default, and have a keywork argument (hdf5_chunks) to configure the default chunking.
cfdm.read should, by default, store HDF5 chunking on the returned data, so that it will be used when when writing out to a new netCDF4 file.
Setting a HDF5 chunking strategy should be more intuitive. E.g. it should be easy to "chunk the time axis by 12 elements, leaving all other axes unchunked": f.nc_set_hdf_chunksizes({'T': 12})
Setting HDF5 chunksizes follows the Dask API for defining its computaitonal chunk sizes. E.g. f.nc_set_hdf_chunksizes("8 MiB")

PR to follow.