loading data on HPC using intake catalog

Question

loading data on HPC using intake catalog

cpatrizio88 opened this issue 5 years ago · 6 comments

Hello,

I have been experiencing an issue when trying to load data using an intake catalog. I've successfully loaded the catalog as shown below:

cat = intake.open_catalog('catalog.yaml')
list(cat)

['sea_surface_height',
 'ECCOv4r3',
 'SOSE',
 'LLC4320_grid',
 'LLC4320_SST',
 'LLC4320_SSS',
 'LLC4320_SSH',
 'LLC4320_SSU',
 'LLC4320_SSV',
 'CESM_POP_hires_control',
 'CESM_POP_hires_RCP8_5',
 'GFDL_CM2_6_control_ocean_surface',
 'GFDL_CM2_6_control_ocean_3D',
 'GFDL_CM2_6_one_percent_ocean_surface',
 'GFDL_CM2_6_one_percent_ocean_3D',
 'GFDL_CM2_6_grid']

However, when I go to load any of the datasets, I get the following error:

ds = cat.ECCOv4r3.to_dask()
ds

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-12-99add3ca0ed7> in <module>
----> 1 ds = cat.ECCOv4r3.to_dask()
      2 ds

~/miniconda3/envs/pangeo/lib/python3.7/site-packages/intake_xarray/base.py in to_dask(self)
     68     def to_dask(self):
     69         """Return xarray object where variables are dask arrays"""
---> 70         return self.read_chunked()
     71 
     72     def close(self):

~/miniconda3/envs/pangeo/lib/python3.7/site-packages/intake_xarray/base.py in read_chunked(self)
     43     def read_chunked(self):
     44         """Return xarray object (which will have chunks)"""
---> 45         self._load_metadata()
     46         return self._ds
     47 

~/miniconda3/envs/pangeo/lib/python3.7/site-packages/intake/source/base.py in _load_metadata(self)
    115         """load metadata only if needed"""
    116         if self._schema is None:
--> 117             self._schema = self._get_schema()
    118             self.datashape = self._schema.datashape
    119             self.dtype = self._schema.dtype

~/miniconda3/envs/pangeo/lib/python3.7/site-packages/intake_xarray/base.py in _get_schema(self)
     17 
     18         if self._ds is None:
---> 19             self._open_dataset()
     20 
     21             metadata = {

~/miniconda3/envs/pangeo/lib/python3.7/site-packages/intake_xarray/xzarr.py in _open_dataset(self)
     30         update_storage_options(options, self.storage_options)
     31 
---> 32         self._fs, _ = get_fs(protocol, options)
     33         if protocol != 'file':
     34             self._mapper = get_mapper(protocol, self._fs, urlpath)

~/miniconda3/envs/pangeo/lib/python3.7/site-packages/dask/bytes/core.py in get_fs(protocol, storage_options)
    569             "    pip install gcsfs",
    570         )
--> 571         cls = _filesystems[protocol]
    572 
    573     elif protocol in ["adl", "adlfs"]:

KeyError: 'gcs'

It appears to be an error related to google cloud storage. For reference, I am using the base pangeo environment provided here: https://github.com/pangeo-data/pangeo-stacks/blob/master/pangeo-notebook/binder/environment.yml

Answer 1 · 2019-07-26T20:56:06.000Z

ping @martindurant on this one.

Answer 2 · 2019-07-26T20:57:52.000Z

gcsfs has been updated to use fsspec and released, but dask has not yet. Either please install dask from master, or use an older version of gcsfs.

Answer 3 · 2019-07-26T22:09:48.000Z

Thank you @martindurant , that seemed to do the trick for the key error, however there is a different issue now. When loading the dataset, it eventually gives me this error (after running for some time):

ds = cat.SOSE.to_dask()
ds

---------------------------------------------------------------------------
OSError                                   Traceback (most recent call last)
~/miniconda3/envs/pangeo/lib/python3.7/site-packages/urllib3/connection.py in _new_conn(self)
    159             conn = connection.create_connection(
--> 160                 (self._dns_host, self.port), self.timeout, **extra_kw)
    161 

~/miniconda3/envs/pangeo/lib/python3.7/site-packages/urllib3/util/connection.py in create_connection(address, timeout, source_address, socket_options)
     79     if err is not None:
---> 80         raise err
     81 

~/miniconda3/envs/pangeo/lib/python3.7/site-packages/urllib3/util/connection.py in create_connection(address, timeout, source_address, socket_options)
     69                 sock.bind(source_address)
---> 70             sock.connect(sa)
     71             return sock

OSError: [Errno 101] Network is unreachable

I should mention that my Jupyter notebook has been launched from a HPC (Cheyenne).

Answer 4 · 2019-07-26T22:54:31.000Z

That sounds like something far deeper a level than Intake is working in.
You could debug and go up the stack to try to find exactly what call is being executed to cause this. You may need to specify an auth mechanism to gcsfs with the storage_options parameter; perhaps it may be worth importing gcsfs directly and seeing how you can successfully connect to the target data.

Answer 5 · 2019-07-28T04:19:10.000Z

@cpatrizio88 - the cheyenne compute nodes do not have outside network access so they won't be able to access datasets stored on GCS.

Answer 6 · 2019-07-28T05:18:00.000Z

@jhamman ah, that's what I suspected... that is unfortunate! Well, thanks anyway. I will have to get the data transferred to the GLADE file systems then.