intake/intake-xarray

`to_dask()` not lazy when `simplecache::` in urlpath

aaronspring opened this issue · 1 comments

when loading to_dask with caching as in pangeo-data/pangeo-datastore#113, fsspec.open_local first loads the whole dataset and then opens the data in xarray, still with chunks but after having spend the time on downloading.

is there a way to circumvent this in intake-xarray or is this a consequence from fsspec caching that cannot be changed for intake-xarray?

it would be great to just do to_dask() without spending the time to download and only cache when xarray runs compute.

Whilst this may be possible, it would be tricky. Dask wants to open the file to assess the chunking; it could be done on the original file, but only cache it when actually loading, in theory. There is a block-wise cacher in fsspec, which only downloads the parts of a file that are accessed, as they are accessed, but that only works with a library expecting to work with python file-like objects (i.e., there's a reason to call open_local: the library wants a real local file). You could do something with FUSE, where the file looks real to the OS, but uses block-wise chunking internally - this kind of thing I'm pretty sure has never been tried.