Nwm client IndexError: invalid index to scalar variable.
Closed this issue · 6 comments
Justin Hunter reported an issue when trying to retrieve a short range forecast using the nwm_client
's gcp.NWMDataService
. I verified that I can reproduce the issue locally.
Reproduce
pip install "hydrotools.nwm_client[gcp]"
pip list | grep nwm
hydrotools.nwm-client 5.0.1
from hydrotools.nwm_client import gcp
service = gcp.NWMDataService()
df = service.get(configuration="short_range", reference_time="20210101T01Z")
concurrent.futures.process._RemoteTraceback:
"""
Traceback (most recent call last):
File "/usr/local/Caskroom/miniconda/base/envs/venv/lib/python3.8/concurrent/futures/process.py", line 239, in _process_worker
r = call_item.fn(*call_item.args, **call_item.kwargs)
File "/usr/local/Caskroom/miniconda/base/envs/venv/lib/python3.8/concurrent/futures/process.py", line 198, in _process_chunk
return [fn(*args) for args in chunk]
File "/usr/local/Caskroom/miniconda/base/envs/venv/lib/python3.8/concurrent/futures/process.py", line 198, in <listcomp>
return [fn(*args) for args in chunk]
File "~/github/sandbox/test/venv/lib/python3.8/site-packages/hydrotools/nwm_client/gcp.py", line 274, in get_DataFrame
scale_factor = ds['streamflow'].scale_factor[0]
IndexError: invalid index to scalar variable.
"""
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "~/github/sandbox/test/venv/lib/python3.8/site-packages/hydrotools/nwm_client/gcp.py", line 429, in get
return cache.get(
File "~/github/sandbox/test/venv/lib/python3.8/site-packages/hydrotools/caches/hdf.py", line 93, in get
df = function(*args, **kwargs)
File "~/github/sandbox/test/venv/lib/python3.8/site-packages/hydrotools/nwm_client/gcp.py", line 353, in get_cycle
df = pd.concat(dataframes)
File "~/github/sandbox/test/venv/lib/python3.8/site-packages/pandas/util/_decorators.py", line 311, in wrapper
return func(*args, **kwargs)
File "~/github/sandbox/test/venv/lib/python3.8/site-packages/pandas/core/reshape/concat.py", line 346, in concat
op = _Concatenator(
File "~/github/sandbox/test/venv/lib/python3.8/site-packages/pandas/core/reshape/concat.py", line 400, in __init__
objs = list(objs)
File "/usr/local/Caskroom/miniconda/base/envs/venv/lib/python3.8/concurrent/futures/process.py", line 484, in _chain_from_iterable_of_lists
for element in iterable:
File "/usr/local/Caskroom/miniconda/base/envs/venv/lib/python3.8/concurrent/futures/_base.py", line 619, in result_iterator
yield fs.pop().result()
File "/usr/local/Caskroom/miniconda/base/envs/venv/lib/python3.8/concurrent/futures/_base.py", line 437, in result
return self.__get_result()
File "/usr/local/Caskroom/miniconda/base/envs/venv/lib/python3.8/concurrent/futures/_base.py", line 389, in __get_result
raise self._exception
IndexError: invalid index to scalar variable.
From the stack trace, it appears that the metadata field, scaling_factor
, for the streamflow
variables in one of the NWM's channel_rt output files is not being deserialized as a collection (list, etc.) and instead is just a scalar variable (int, float, etc.). This may have been caused by a downstream change to a dependency (xarray, h5netcdf).
I was able to resolve this issue by removing the index in the scale_factor
object.
line 274 python/nwm_client/src/hydrotools/nwm_client/gcp.py
# Extract scale factor
scale_factor = ds['streamflow'].scale_factor[0]
# fixed with
scale_factor = ds['streamflow'].scale_factor
I am assuming that the metadata layout of NWM channel route link files is pretty static over time as we've not seen this issue before. I assume this is a deserialization issue propagating from, if I had to guess, xarray.
It might be best if we push a hot fix that guards and type checks the scale_factor
field while we track down and figure out what is causing this and determine a long term solution.
Found the issue. It is propagatingh5netcdf
. Today they pushed 0.14.0
which introduced the following per their change log.
Return items from 0-dim and one-element 1-dim array attributes. Return multi-element attributes as lists. Return string attributes as Python strings decoded from their respective encoding (utf-8, ascii). By Kai Mühlbauer.
I verified that rolling the version back to 0.13.0
resolved this issue.
Now as to how we should proceed. I know previously I said:
It might be best if we push a hot fix that guards and type checks the scale_factor field while we track down and figure out what is causing this and determine a long term solution.
In this case, I think it makes sense to just type check ds.streamflow.scale_factor
and handle the case where a scalar is returned. I dont want to force others to comply with a version pinning of h5netcdf. Thoughts @jarq6c?
proposed solution
streamflow = ds['streamflow']
# h5netcdf <= 0.13.0 always deserializes numeric attributes to numpy arrays.
# even if there will only be one item in the array.
if isinstance(streamflow.scale_factor, np.ndarray):
scale_factor = streamflow.scale_factor[0]
# h5netcdf > 0.13.0 deserializes numeric attributes to numpy arrays if there is more than scalar in the attribute.
# otherwise, a scalar numpy value is returned
else:
scale_factor = streamflow.scale_factor
If the source attribute was a single scalar all along and was only returned in a list
because of some conceit of h5netcdf
, I'm inclined to just drop the index and leave it at that. Is there a good reason to continue supporting h5netcdf <= 0.13.0
?
After talking with @jarq6c offline, we came to a solution (please correct me where necessary @jarq6c). Given that h5netcdf==0.14.0
was released on 2022-02-25, we will pin the current version of nwm_client
(5.0.1
) to h5netcdf <= 0.13.0
and release the software as a post release to 5.0.1
. Subsequently, nwm_client==5.0.2
will be released and pin h5netcdf >= 0.14.0
. 5.0.2
will include a patch that resolves complies with h5netcdf >= 0.14.0
.