trhallam/segysak

`fill_cdpna()` causes crash in output

aadm opened this issue · 2 comments

aadm commented

With an input data read from netcdf via dask, applying fill_cdpna() causes an error when writing back the file to netcdf.

One solution is to 'realize' the data and then outputting to disk, e.g.:

cube.compute().seisio.to_netcdf('cube_crop.seisnc')

But this is not applicable if the data size is very large (it would crash simply because there is not enough ram to hold the data in memory).

Example:

cube = open_seisnc(local_data+'cube.seisnc', chunks={'xline': 100, 'iline': 100})
cube.seis.fill_cdpna()
cube = cube.sel(twt=slice(3800,5500))
cube.seisio.to_netcdf('cube_crop.seisnc')

Resulting error log copied from jupyter console:

In [7]: cube.seisio.to_netcdf('cube_crop.seisnc')
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-7-0f071262ca7b> in <module>
----> 1 near.seisio.to_netcdf(local_data+'md4_near_cond_crop.seisnc')
~/miniconda3/envs/wrk/lib/python3.7/site-packages/segysak/_accessor.py in to_netcdf(self, seisnc, **kwargs)
     46         kwargs["engine"] = "h5netcdf"
     47 
---> 48         self._obj.to_netcdf(seisnc, **kwargs)
     49 
     50 
~/miniconda3/envs/wrk/lib/python3.7/site-packages/xarray/core/dataset.py in to_netcdf(self, path, mode, format, group, engine, encoding, unlimited_dims, compute, invalid_netcdf)
   1566             unlimited_dims=unlimited_dims,
   1567             compute=compute,
-> 1568             invalid_netcdf=invalid_netcdf,
   1569         )
   1570 
~/miniconda3/envs/wrk/lib/python3.7/site-packages/xarray/backends/api.py in to_netcdf(dataset, path_or_file, mode, format, group, engine, encoding, unlimited_dims, compute, multifile, invalid_netcdf)
   1080         # to be parallelized with dask
   1081         dump_to_store(
-> 1082             dataset, store, writer, encoding=encoding, unlimited_dims=unlimited_dims
   1083         )
   1084         if autoclose:
~/miniconda3/envs/wrk/lib/python3.7/site-packages/xarray/backends/api.py in dump_to_store(dataset, store, writer, encoder, encoding, unlimited_dims)
   1126         variables, attrs = encoder(variables, attrs)
   1127 
-> 1128     store.store(variables, attrs, check_encoding, writer, unlimited_dims=unlimited_dims)
   1129 
   1130 
~/miniconda3/envs/wrk/lib/python3.7/site-packages/xarray/backends/common.py in store(self, variables, attributes, check_encoding_set, writer, unlimited_dims)
    296         self.set_dimensions(variables, unlimited_dims=unlimited_dims)
    297         self.set_variables(
--> 298             variables, check_encoding_set, writer, unlimited_dims=unlimited_dims
    299         )
    300 
~/miniconda3/envs/wrk/lib/python3.7/site-packages/xarray/backends/common.py in set_variables(self, variables, check_encoding_set, writer, unlimited_dims)
    334             check = vn in check_encoding_set
    335             target, source = self.prepare_variable(
--> 336                 name, v, check, unlimited_dims=unlimited_dims
    337             )
    338 
~/miniconda3/envs/wrk/lib/python3.7/site-packages/xarray/backends/h5netcdf_.py in prepare_variable(self, name, variable, check_encoding, unlimited_dims)
    287                 dimensions=variable.dims,
    288                 fillvalue=fillvalue,
--> 289                 **kwargs,
    290             )
    291         else:
~/miniconda3/envs/wrk/lib/python3.7/site-packages/h5netcdf/core.py in create_variable(self, name, dimensions, dtype, data, fillvalue, **kwargs)
    499             group = group._require_child_group(k)
    500         return group._create_child_variable(keys[-1], dimensions, dtype, data,
--> 501                                             fillvalue, **kwargs)
    502 
    503     def _get_child(self, key):
~/miniconda3/envs/wrk/lib/python3.7/site-packages/h5netcdf/core.py in _create_child_variable(self, name, dimensions, dtype, data, fillvalue, **kwargs)
    474             h5ds = self._h5group[name]
    475             if _netcdf_dimension_but_not_variable(h5ds):
--> 476                 self._detach_dim_scale(name)
    477                 del self._h5group[name]
    478 
~/miniconda3/envs/wrk/lib/python3.7/site-packages/h5netcdf/core.py in _detach_dim_scale(self, name)
    572             for n, dim in enumerate(var.dimensions):
    573                 if dim == name:
--> 574                     var._h5ds.dims[n].detach_scale(self._all_h5groups[dim])
    575 
    576         for subgroup in self.groups.values():
~/miniconda3/envs/wrk/lib/python3.7/site-packages/h5py/_hl/dims.py in detach_scale(self, dset)
    101         """
    102         with phil:
--> 103             h5ds.detach_scale(self._id, dset.id, self._dimension)
    104 
    105     def items(self):
h5py/_objects.pyx in h5py._objects.with_phil.wrapper()
h5py/_objects.pyx in h5py._objects.with_phil.wrapper()
h5py/h5ds.pyx in h5py.h5ds.detach_scale()
h5py/defs.pyx in h5py.defs.H5DSdetach_scale()
RuntimeError: Unspecified error in H5DSdetach_scale (return value <0)

Thanks @aadm I'm not sure if this can be fixed but I'll have a look. An alternative option for us that is probably simpler is to allow fill_cdpna() to be run during the conversion process, and then it would be implicit in the NetCDF4 file. It is a little frustrating that it is not super easy to modify parts of NetCDF4 on disk.

One option I might get you to try. Is open the file without using dask ('this won't load it straight away'). Run fill_cdpna() and then use the to_netcdf() method with mode='a' to append modified variables. Then try your cropping operation using dask.

So your example would be:

cube = open_seisnc(local_data+'cube.seisnc')
cube.seis.fill_cdpna()
cube.seis.to_netcdf(local_data+'cube.seisnc', mode='a') # Note this will modify the file so you might want to back it up as a test.

cube = open_seisnc(local_data+'cube.seisnc', chunks={'xline': 100, 'iline': 100})
cube = cube.sel(twt=slice(3800,5500))
cube.seisio.to_netcdf('cube_crop.seisnc')

This has been addressed in v0.5 with lazy loading support and reversion to xarray native netcdf methods.

Note: please use the new accessor namespace method .segysak.fill_cdpna().

Note: The current release candidate has a bug but the actual release fill fix a uint downcasting issue that isn't compatible with xr.to_netcdf.