nasa/EMIT-Data-Resources

chunked reading of files

GMoncrieff opened this issue · 2 comments

Currently reading files using emit_xarray from emit_tools.py reads into a nd.array backed xr.dataset. An option to read into a chunked dask.array backed xr.dataset would help prevent out-of-memory errors when reading on machines with limited memory (loading failed on an 8GB SMCE machine) and potentially speed up operations on downstream operations using dask.

Adding chunks='auto' to

ds = xr.open_dataset(filepath,engine = engine)
works when ortho=False but not for ortho=True

ebolch commented

I will look into this some more. When orthorectifying, the geometry lookup table (GLT) included in the location group is used to reshape the data array to fit the GLT dimensions. I think to do that on chunks would require modification of the GLT. My understanding is that chunking along the bands dimension would likely slow down operations focused on imaging spectroscopy since that would require more reads. Maybe @pgbrodrick has some insight?

At minimum we can at least add notes about memory requirements.

Yes, I tried fiddling with the orthorectification but could not get it to work with dask.arrays. In the version of emit_xarray I am working with in my project utils/emit_tools.py I specify an option to chunk along the crosstrack and downtrack dims that is only accepted if ortho=False.

I think this is worthwhile because the recommended workflow is to perform data manipulations and biophysical modelling on the unorthorectified data, and later orthorectify the variables produced downstream. This works for me because the output variables will typically have far fewer variables than the image cube has bands (I go from 250ish bands to 4 endmembers). It then becomes much more feasible to load the data into memory before orthorectifying.

My guess is that this workflow is common enough to warrant accommodating it by specifying a chucking option if ortho=False