intake/intake-xarray

Bug: Unable to specify list for urlpath with NetCDFSource + fsspec

pbranson opened this issue · 6 comments

I have encountered a bug with a recent change in either intake-xarray or fsspec where passing a list for the urlpath to intake-xarray fails. This worked previously. Sorry havent had time to dig further, but thought I would log it

A minimum reproducible example is:

import intake
from intake.catalog import Catalog
from intake.catalog.local import LocalCatalogEntry
mycat = Catalog.from_dict({
    'eta': LocalCatalogEntry('test', 'test multi file', 'netcdf', args={'urlpath':['https://dapds00.nci.org.au/thredds/dodsC/rr6/oceanmaps_datasets/version_3.3/forecast/latest/ocean_fc_2021011512_000_eta.nc','https://dapds00.nci.org.au/thredds/dodsC/rr6/oceanmaps_datasets/version_3.3/forecast/latest/ocean_fc_2021011512_024_eta.nc'],
                                                                            'concat_dim':'Time',
                                                                            'combine':'nested'}),
    })
print(mycat.eta.yaml())
mycat.eta.to_dask()

Which outputs:

sources:
  test:
    args:
      combine: nested
      concat_dim: Time
      urlpath:
      - https://dapds00.nci.org.au/thredds/dodsC/rr6/oceanmaps_datasets/version_3.3/forecast/latest/ocean_fc_2021011512_000_eta.nc
      - https://dapds00.nci.org.au/thredds/dodsC/rr6/oceanmaps_datasets/version_3.3/forecast/latest/ocean_fc_2021011512_024_eta.nc
    description: test multi file
    driver: intake_xarray.netcdf.NetCDFSource
    metadata:
      catalog_dir: ''


AttributeError                            Traceback (most recent call last)
<ipython-input-37-f969d7f0cd32> in <module>
      7     })
      8 print(mycat.eta.yaml())
----> 9 mycat.eta.to_dask()

~/miniconda3/envs/roampy-dev/lib/python3.7/site-packages/intake_xarray/base.py in to_dask(self)
     67     def to_dask(self):
     68         """Return xarray object where variables are dask arrays"""
---> 69         return self.read_chunked()
     70 
     71     def close(self):

~/miniconda3/envs/roampy-dev/lib/python3.7/site-packages/intake_xarray/base.py in read_chunked(self)
     42     def read_chunked(self):
     43         """Return xarray object (which will have chunks)"""
---> 44         self._load_metadata()
     45         return self._ds
     46 

~/miniconda3/envs/roampy-dev/lib/python3.7/site-packages/intake/source/base.py in _load_metadata(self)
    124         """load metadata only if needed"""
    125         if self._schema is None:
--> 126             self._schema = self._get_schema()
    127             self.datashape = self._schema.datashape
    128             self.dtype = self._schema.dtype

~/miniconda3/envs/roampy-dev/lib/python3.7/site-packages/intake_xarray/base.py in _get_schema(self)
     16 
     17         if self._ds is None:
---> 18             self._open_dataset()
     19 
     20             metadata = {

~/miniconda3/envs/roampy-dev/lib/python3.7/site-packages/intake_xarray/netcdf.py in _open_dataset(self)
     88         else:
     89             # https://github.com/intake/filesystem_spec/issues/476#issuecomment-732372918
---> 90             url = fsspec.open(self.urlpath, **self.storage_options).open()
     91 
     92         self._ds = _open_dataset(url, chunks=self.chunks, **kwargs)

~/miniconda3/envs/roampy-dev/lib/python3.7/site-packages/fsspec/core.py in open(urlpath, mode, compression, encoding, errors, protocol, newline, **kwargs)
    436         newline=newline,
    437         expand=False,
--> 438         **kwargs
    439     )[0]
    440 

~/miniconda3/envs/roampy-dev/lib/python3.7/site-packages/fsspec/core.py in open_files(urlpath, mode, compression, encoding, errors, name_function, num, protocol, newline, auto_mkdir, expand, **kwargs)
    285         storage_options=kwargs,
    286         protocol=protocol,
--> 287         expand=expand,
    288     )
    289     if "r" not in mode and auto_mkdir:

~/miniconda3/envs/roampy-dev/lib/python3.7/site-packages/fsspec/core.py in get_fs_token_paths(urlpath, mode, num, name_function, storage_options, protocol, expand)
    600         cls = get_filesystem_class(protocol)
    601         optionss = list(map(cls._get_kwargs_from_urls, urlpath))
--> 602         paths = [cls._strip_protocol(u) for u in urlpath]
    603         options = optionss[0]
    604         if not all(o == options for o in optionss):

~/miniconda3/envs/roampy-dev/lib/python3.7/site-packages/fsspec/core.py in <listcomp>(.0)
    600         cls = get_filesystem_class(protocol)
    601         optionss = list(map(cls._get_kwargs_from_urls, urlpath))
--> 602         paths = [cls._strip_protocol(u) for u in urlpath]
    603         options = optionss[0]
    604         if not all(o == options for o in optionss):

~/miniconda3/envs/roampy-dev/lib/python3.7/site-packages/fsspec/implementations/local.py in _strip_protocol(cls, path)
    145     def _strip_protocol(cls, path):
    146         path = stringify_path(path)
--> 147         if path.startswith("file://"):
    148             path = path[7:]
    149         path = os.path.expanduser(path)

AttributeError: 'list' object has no attribute 'startswith'

Relevant library versions:

aiohttp==3.7.3
intake==0.6.0
intake-xarray==0.4.1
fsspec==0.8.5
xarray==0.16.2

cc @martindurant @scottyhq

thanks for the detailed reports @pbranson looks like we should definitely add some additional tests in this library to try and catch these issues.

I think an open question right now is whether fsspec or xarray should handle the loading. Hopefully @martindurant can provide some guidance here. With some backend refactoring in xarray underway i'm not sure of the best way forward as it seems there are incompatibilities with how the data is stored (Zarr, netcdf, tiff) and what code ultimately handles I/O (zarr, h5netcdf, rasterio) given either URLs or fsspec objects. See pydata/xarray#4823 (comment).

xarray should do this in the long term, but the storage backend implementation has changed a lot. The PR that is still not merged now only deals with the zarr engine, precisely because it's not clear which engine can accept what kind of input, and there are no tests for any of it.

Yeah I'm not sure the best solution here.

It would seem that the solution would be to revert the NetCDFSource to defer to XArray, but where does that leave catalogs of netCDF files accessed from a http server.

Maybe could add an engine argument that passes urlpath as str for engine='netcdf'?

Or add some of the features (like a list or pattern of urls) that are in NetCDFSource to OpenDAPSource?

intake_thredds handles this

import intake_thredds
intake_thredds.__version__ # '2021.6.16'

intake_thredds.THREDDSMergedSource(url='simplecache::https://dapds00.nci.org.au/thredds/catalog/rr6/oceanmaps_datasets/version_3.3/analysis/eta/catalog.xml',
                                  path=['ocean_an00_2020052*12_eta.nc'],
                                  driver='netcdf').to_dask()
<xarray.Dataset>
Dimensions:      (xt_ocean: 3600, yt_ocean: 1500, Time: 9, nv: 2)
Coordinates:
  * xt_ocean     (xt_ocean) float64 0.05 0.15 0.25 0.35 ... 359.8 359.9 360.0
  * yt_ocean     (yt_ocean) float64 -74.95 -74.85 -74.75 ... 74.75 74.85 74.95
  * Time         (Time) datetime64[ns] 2020-05-22 2020-05-23 ... 2020-05-30
  * nv           (nv) float64 1.0 2.0
Data variables:
    eta_t        (Time, yt_ocean, xt_ocean) float32 dask.array<chunksize=(1, 1500, 3600), meta=np.ndarray>
    average_T1   (Time) datetime64[ns] dask.array<chunksize=(1,), meta=np.ndarray>
    average_T2   (Time) datetime64[ns] dask.array<chunksize=(1,), meta=np.ndarray>
    average_DT   (Time) timedelta64[ns] dask.array<chunksize=(1,), meta=np.ndarray>
    Time_bounds  (Time, nv) datetime64[ns] dask.array<chunksize=(1, 2), meta=np.ndarray>

@pbranson is this still an issue for you?

Thanks @aaronspring yes this isnt so much a problem now that intake-thredds has been developed further.