Bug: Unable to specify list for urlpath with NetCDFSource + fsspec
pbranson opened this issue · 6 comments
I have encountered a bug with a recent change in either intake-xarray or fsspec where passing a list for the urlpath to intake-xarray fails. This worked previously. Sorry havent had time to dig further, but thought I would log it
A minimum reproducible example is:
import intake
from intake.catalog import Catalog
from intake.catalog.local import LocalCatalogEntry
mycat = Catalog.from_dict({
'eta': LocalCatalogEntry('test', 'test multi file', 'netcdf', args={'urlpath':['https://dapds00.nci.org.au/thredds/dodsC/rr6/oceanmaps_datasets/version_3.3/forecast/latest/ocean_fc_2021011512_000_eta.nc','https://dapds00.nci.org.au/thredds/dodsC/rr6/oceanmaps_datasets/version_3.3/forecast/latest/ocean_fc_2021011512_024_eta.nc'],
'concat_dim':'Time',
'combine':'nested'}),
})
print(mycat.eta.yaml())
mycat.eta.to_dask()
Which outputs:
sources:
test:
args:
combine: nested
concat_dim: Time
urlpath:
- https://dapds00.nci.org.au/thredds/dodsC/rr6/oceanmaps_datasets/version_3.3/forecast/latest/ocean_fc_2021011512_000_eta.nc
- https://dapds00.nci.org.au/thredds/dodsC/rr6/oceanmaps_datasets/version_3.3/forecast/latest/ocean_fc_2021011512_024_eta.nc
description: test multi file
driver: intake_xarray.netcdf.NetCDFSource
metadata:
catalog_dir: ''
AttributeError Traceback (most recent call last)
<ipython-input-37-f969d7f0cd32> in <module>
7 })
8 print(mycat.eta.yaml())
----> 9 mycat.eta.to_dask()
~/miniconda3/envs/roampy-dev/lib/python3.7/site-packages/intake_xarray/base.py in to_dask(self)
67 def to_dask(self):
68 """Return xarray object where variables are dask arrays"""
---> 69 return self.read_chunked()
70
71 def close(self):
~/miniconda3/envs/roampy-dev/lib/python3.7/site-packages/intake_xarray/base.py in read_chunked(self)
42 def read_chunked(self):
43 """Return xarray object (which will have chunks)"""
---> 44 self._load_metadata()
45 return self._ds
46
~/miniconda3/envs/roampy-dev/lib/python3.7/site-packages/intake/source/base.py in _load_metadata(self)
124 """load metadata only if needed"""
125 if self._schema is None:
--> 126 self._schema = self._get_schema()
127 self.datashape = self._schema.datashape
128 self.dtype = self._schema.dtype
~/miniconda3/envs/roampy-dev/lib/python3.7/site-packages/intake_xarray/base.py in _get_schema(self)
16
17 if self._ds is None:
---> 18 self._open_dataset()
19
20 metadata = {
~/miniconda3/envs/roampy-dev/lib/python3.7/site-packages/intake_xarray/netcdf.py in _open_dataset(self)
88 else:
89 # https://github.com/intake/filesystem_spec/issues/476#issuecomment-732372918
---> 90 url = fsspec.open(self.urlpath, **self.storage_options).open()
91
92 self._ds = _open_dataset(url, chunks=self.chunks, **kwargs)
~/miniconda3/envs/roampy-dev/lib/python3.7/site-packages/fsspec/core.py in open(urlpath, mode, compression, encoding, errors, protocol, newline, **kwargs)
436 newline=newline,
437 expand=False,
--> 438 **kwargs
439 )[0]
440
~/miniconda3/envs/roampy-dev/lib/python3.7/site-packages/fsspec/core.py in open_files(urlpath, mode, compression, encoding, errors, name_function, num, protocol, newline, auto_mkdir, expand, **kwargs)
285 storage_options=kwargs,
286 protocol=protocol,
--> 287 expand=expand,
288 )
289 if "r" not in mode and auto_mkdir:
~/miniconda3/envs/roampy-dev/lib/python3.7/site-packages/fsspec/core.py in get_fs_token_paths(urlpath, mode, num, name_function, storage_options, protocol, expand)
600 cls = get_filesystem_class(protocol)
601 optionss = list(map(cls._get_kwargs_from_urls, urlpath))
--> 602 paths = [cls._strip_protocol(u) for u in urlpath]
603 options = optionss[0]
604 if not all(o == options for o in optionss):
~/miniconda3/envs/roampy-dev/lib/python3.7/site-packages/fsspec/core.py in <listcomp>(.0)
600 cls = get_filesystem_class(protocol)
601 optionss = list(map(cls._get_kwargs_from_urls, urlpath))
--> 602 paths = [cls._strip_protocol(u) for u in urlpath]
603 options = optionss[0]
604 if not all(o == options for o in optionss):
~/miniconda3/envs/roampy-dev/lib/python3.7/site-packages/fsspec/implementations/local.py in _strip_protocol(cls, path)
145 def _strip_protocol(cls, path):
146 path = stringify_path(path)
--> 147 if path.startswith("file://"):
148 path = path[7:]
149 path = os.path.expanduser(path)
AttributeError: 'list' object has no attribute 'startswith'
Relevant library versions:
aiohttp==3.7.3
intake==0.6.0
intake-xarray==0.4.1
fsspec==0.8.5
xarray==0.16.2
thanks for the detailed reports @pbranson looks like we should definitely add some additional tests in this library to try and catch these issues.
I think an open question right now is whether fsspec or xarray should handle the loading. Hopefully @martindurant can provide some guidance here. With some backend refactoring in xarray underway i'm not sure of the best way forward as it seems there are incompatibilities with how the data is stored (Zarr, netcdf, tiff) and what code ultimately handles I/O (zarr, h5netcdf, rasterio) given either URLs or fsspec objects. See pydata/xarray#4823 (comment).
xarray should do this in the long term, but the storage backend implementation has changed a lot. The PR that is still not merged now only deals with the zarr engine, precisely because it's not clear which engine can accept what kind of input, and there are no tests for any of it.
Yeah I'm not sure the best solution here.
It would seem that the solution would be to revert the NetCDFSource to defer to XArray, but where does that leave catalogs of netCDF files accessed from a http server.
Maybe could add an engine argument that passes urlpath as str for engine='netcdf'?
Or add some of the features (like a list or pattern of urls) that are in NetCDFSource to OpenDAPSource?
intake_thredds
handles this
import intake_thredds
intake_thredds.__version__ # '2021.6.16'
intake_thredds.THREDDSMergedSource(url='simplecache::https://dapds00.nci.org.au/thredds/catalog/rr6/oceanmaps_datasets/version_3.3/analysis/eta/catalog.xml',
path=['ocean_an00_2020052*12_eta.nc'],
driver='netcdf').to_dask()
<xarray.Dataset>
Dimensions: (xt_ocean: 3600, yt_ocean: 1500, Time: 9, nv: 2)
Coordinates:
* xt_ocean (xt_ocean) float64 0.05 0.15 0.25 0.35 ... 359.8 359.9 360.0
* yt_ocean (yt_ocean) float64 -74.95 -74.85 -74.75 ... 74.75 74.85 74.95
* Time (Time) datetime64[ns] 2020-05-22 2020-05-23 ... 2020-05-30
* nv (nv) float64 1.0 2.0
Data variables:
eta_t (Time, yt_ocean, xt_ocean) float32 dask.array<chunksize=(1, 1500, 3600), meta=np.ndarray>
average_T1 (Time) datetime64[ns] dask.array<chunksize=(1,), meta=np.ndarray>
average_T2 (Time) datetime64[ns] dask.array<chunksize=(1,), meta=np.ndarray>
average_DT (Time) timedelta64[ns] dask.array<chunksize=(1,), meta=np.ndarray>
Time_bounds (Time, nv) datetime64[ns] dask.array<chunksize=(1, 2), meta=np.ndarray>
@pbranson is this still an issue for you?
Thanks @aaronspring yes this isnt so much a problem now that intake-thredds has been developed further.