loading data on HPC using intake catalog
cpatrizio88 opened this issue · 6 comments
Hello,
I have been experiencing an issue when trying to load data using an intake catalog. I've successfully loaded the catalog as shown below:
cat = intake.open_catalog('catalog.yaml')
list(cat)
['sea_surface_height',
'ECCOv4r3',
'SOSE',
'LLC4320_grid',
'LLC4320_SST',
'LLC4320_SSS',
'LLC4320_SSH',
'LLC4320_SSU',
'LLC4320_SSV',
'CESM_POP_hires_control',
'CESM_POP_hires_RCP8_5',
'GFDL_CM2_6_control_ocean_surface',
'GFDL_CM2_6_control_ocean_3D',
'GFDL_CM2_6_one_percent_ocean_surface',
'GFDL_CM2_6_one_percent_ocean_3D',
'GFDL_CM2_6_grid']
However, when I go to load any of the datasets, I get the following error:
ds = cat.ECCOv4r3.to_dask()
ds
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
<ipython-input-12-99add3ca0ed7> in <module>
----> 1 ds = cat.ECCOv4r3.to_dask()
2 ds
~/miniconda3/envs/pangeo/lib/python3.7/site-packages/intake_xarray/base.py in to_dask(self)
68 def to_dask(self):
69 """Return xarray object where variables are dask arrays"""
---> 70 return self.read_chunked()
71
72 def close(self):
~/miniconda3/envs/pangeo/lib/python3.7/site-packages/intake_xarray/base.py in read_chunked(self)
43 def read_chunked(self):
44 """Return xarray object (which will have chunks)"""
---> 45 self._load_metadata()
46 return self._ds
47
~/miniconda3/envs/pangeo/lib/python3.7/site-packages/intake/source/base.py in _load_metadata(self)
115 """load metadata only if needed"""
116 if self._schema is None:
--> 117 self._schema = self._get_schema()
118 self.datashape = self._schema.datashape
119 self.dtype = self._schema.dtype
~/miniconda3/envs/pangeo/lib/python3.7/site-packages/intake_xarray/base.py in _get_schema(self)
17
18 if self._ds is None:
---> 19 self._open_dataset()
20
21 metadata = {
~/miniconda3/envs/pangeo/lib/python3.7/site-packages/intake_xarray/xzarr.py in _open_dataset(self)
30 update_storage_options(options, self.storage_options)
31
---> 32 self._fs, _ = get_fs(protocol, options)
33 if protocol != 'file':
34 self._mapper = get_mapper(protocol, self._fs, urlpath)
~/miniconda3/envs/pangeo/lib/python3.7/site-packages/dask/bytes/core.py in get_fs(protocol, storage_options)
569 " pip install gcsfs",
570 )
--> 571 cls = _filesystems[protocol]
572
573 elif protocol in ["adl", "adlfs"]:
KeyError: 'gcs'
It appears to be an error related to google cloud storage. For reference, I am using the base pangeo environment provided here: https://github.com/pangeo-data/pangeo-stacks/blob/master/pangeo-notebook/binder/environment.yml
ping @martindurant on this one.
gcsfs has been updated to use fsspec and released, but dask has not yet. Either please install dask from master, or use an older version of gcsfs.
Thank you @martindurant , that seemed to do the trick for the key error, however there is a different issue now. When loading the dataset, it eventually gives me this error (after running for some time):
ds = cat.SOSE.to_dask()
ds
---------------------------------------------------------------------------
OSError Traceback (most recent call last)
~/miniconda3/envs/pangeo/lib/python3.7/site-packages/urllib3/connection.py in _new_conn(self)
159 conn = connection.create_connection(
--> 160 (self._dns_host, self.port), self.timeout, **extra_kw)
161
~/miniconda3/envs/pangeo/lib/python3.7/site-packages/urllib3/util/connection.py in create_connection(address, timeout, source_address, socket_options)
79 if err is not None:
---> 80 raise err
81
~/miniconda3/envs/pangeo/lib/python3.7/site-packages/urllib3/util/connection.py in create_connection(address, timeout, source_address, socket_options)
69 sock.bind(source_address)
---> 70 sock.connect(sa)
71 return sock
OSError: [Errno 101] Network is unreachable
I should mention that my Jupyter notebook has been launched from a HPC (Cheyenne).
That sounds like something far deeper a level than Intake is working in.
You could debug and go up the stack to try to find exactly what call is being executed to cause this. You may need to specify an auth mechanism to gcsfs with the storage_options
parameter; perhaps it may be worth importing gcsfs directly and seeing how you can successfully connect to the target data.
@cpatrizio88 - the cheyenne compute nodes do not have outside network access so they won't be able to access datasets stored on GCS.
@jhamman ah, that's what I suspected... that is unfortunate! Well, thanks anyway. I will have to get the data transferred to the GLADE file systems then.