podaac/l2ss-py

Unable to subset grouped dataset with coordinate variables in MMT

Opened this issue ยท 11 comments

Trying to invoke l2ss-py in UAT with TEMPO_NO2-PROXY_L2_V01 collection results in an error from Harmony. Looking at the logs reveals:

message: "Calling l2ss-py subset with params {'bbox': array([[-180,  180],
       [ -90,   90]]), 'variables': ['/product/vertical_column_total'], 'lat_var_names': ['/geolocation/latitude'], 'lon_var_names': ['/geolocation/longitude'], 'time_var_names': [], 'output_file': '/home/dockeruser/data/2db0715037f6004561e2a0b608d3cd053f610471abfc165189ac3aea064517a8.nc', 'file_to_subset': '/tmp/tmpskdk42p8/2db0715037f6004561e2a0b608d3cd053f610471abfc165189ac3aea064517a8.nc', 'origin_source': 'https://dwq9my7b5nvyc.cloudfront.net/asdc2-uat-protected/TEMPO/TEMPO_NO2-PROXY_L2_V01/2013.07.31/TEMPO_NO2-PROXY_L2_V01_20130731T232959Z_S015G06.nc'}"

Then a stacktrace follows:

message: "'/geolocation/latitude'",
exc_info: "Traceback (most recent call last):
  File "/home/dockeruser/.local/lib/python3.9/site-packages/xarray/core/dataset.py", line 1317, in _construct_dataarray
    variable = self._variables[name]
KeyError: '/geolocation/latitude'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/dockeruser/.local/lib/python3.9/site-packages/harmony/cli.py", line 215, in _invoke
    (out_message, out_catalog) = adapter.invoke()
  File "/home/dockeruser/.local/lib/python3.9/site-packages/harmony/adapter.py", line 116, in invoke
    return (self.message, self._process_catalog_recursive(self.catalog))
  File "/home/dockeruser/.local/lib/python3.9/site-packages/harmony/adapter.py", line 181, in _process_catalog_recursive
    output_item = self.process_item(item.clone(), source)
  File "/worker/podaac/subsetter/subset_harmony.py", line 170, in process_item
    result_bbox = subset.subset(**subset_params)
  File "/worker/podaac/subsetter/subset.py", line 1230, in subset
    lat_var_names, lon_var_names, time_var_names = get_coordinate_variable_names(
  File "/worker/podaac/subsetter/subset.py", line 1114, in get_coordinate_variable_names
    time_var_names = [
  File "/worker/podaac/subsetter/subset.py", line 1116, in <listcomp>
    dataset, dataset[lat_var_name]
  File "/home/dockeruser/.local/lib/python3.9/site-packages/xarray/core/dataset.py", line 1410, in __getitem__
    return self._construct_dataarray(key)
  File "/home/dockeruser/.local/lib/python3.9/site-packages/xarray/core/dataset.py", line 1319, in _construct_dataarray
    _, name, variable = _get_virtual_variable(self._variables, name, self.dims)
  File "/home/dockeruser/.local/lib/python3.9/site-packages/xarray/core/dataset.py", line 175, in _get_virtual_variable
    raise KeyError(key)
KeyError: '/geolocation/latitude'",
application: "/home/dockeruser/.local/bin/l2ss_harmony",
workerTimestamp: "2022-11-15T16:36:45.213858Z",
workerLevel: "ERROR"

It appears that because this is a grouped dataset, it is first getting "flattened" so that all variables are at the top level. However, because latitude and longitude variable names are being passed in from Harmony; l2ss-py is trying to use those variable names to index into the dataset. The variable names passed in from harmony still contain paths with / to signify the group but because the dataset has been flattened already, there are no groups and indexing fails.

@frankinspace, do you know which version of the l2ss subsetter was run in this case?

At the time reported, 2.1.1-rc.3 was deployed to UAT.

https://harmony.uat.earthdata.nasa.gov/versions shows which images are currently deployed (drop the .uat for ops)

I've started a branch for this bug (https://github.com/podaac/l2ss-py/tree/issues/127) which at the moment just includes the TEMPO data in the list of test data unit tests get run against. Right now test_subset.test_subset_bbox, test_subset.test_specified_variables, test_subset.test_get_time_variable_name, and test_subset.test_subset_size are all failing for this dataset.

@frankinspace, I haven't yet had a chance to look directly at the broader issue of variable names passed from harmony. But, regarding the failed test function, test_subset.test_subset_bbox() (which will hopefully shed light on that broader issue)...

Consider these lines in test_subset.test_subset_bbox() starting here, where an error is raised during the call to subset.compute_coordinate_variable_names(out_ds):

subset.subset(
    file_to_subset=join(data_dir, test_file),
    bbox=bbox,
    output_file=join(subset_output_dir, output_file)
)

out_ds = xr.open_dataset(join(subset_output_dir, output_file),
                         decode_times=False,
                         decode_coords=False,
                         mask_and_scale=False)

lat_var_name, lon_var_name = subset.compute_coordinate_variable_names(out_ds)

It seems to me that although the variables are being flattened within the call to subset.subset(), they are returned to a 'grouped' state by the time the subset.compute_coordinate_variable_names(out_ds) call is made. And since subset.compute_coordinate_variable_names() only looks at the top-level group that is passed to it, it doesn't find latitude and longitude, which are inside the geolocation group.

Some options are:

  • we could either change the compute_coordinate_variable_names() function to search a group hierarchy,
  • flatten the dataset again in this test function before calling compute_coordinate_variable_names()
  • explicitly grab the geolocation group, and pass only that to the compute_coordinate_variable_names() function

Thoughts? Does this sound reasonable to you?

@danielfromearth yes I agree the compute_coordinate_variable_names function expects the dataset passed into it to already be 'flattened'.

I'm a bit torn on how to handle this one; the test is trying to just use compute_coordinate_variable_names as a convenience for determining the coordinate variables to verify the output of the subset call, it's not the focus of that test.

At the same time, it would be nice if compute_coordinate_variable_names could work on grouped datasets out of the box. But if we were to go that route, I would want to make sure both subset and compute_coordinate_variable_names flatten the dataset in the exact same way, which would require some moderate refactoring I think to pull the flattening out into it's own function/module (I think the h5 and netcdf flattening got separated at some point.)

@frankinspace,

The more I look at this, the more I think refactoring might not be the way to go. The compute_coordinate_variable_names function is expecting an already-flattened dataset. And if we allow compute_coordinate_variable_names to do the flattening, then we end up in a situation where either:

  1. the overall subset procedure does flattening twice (once at the very beginning of subset and then once again when computing variables) no matter what; or
  2. we have to include a check on whether flattening is necessary inside the compute_coordinate_variable_names function, and then do flattening if necessary.

Either way, if compute_coordinate_variable_names does flattening, this will probably require opening the original file (again) with NetCDF4 in append mode (since that's necessary for the group "walk") as well as using xarray, which is what compute_coordinate_variable_names uses now.

A 3rd option would be to avoid ever using compute_coordinate_variable_names outside of the overall subset function, i.e., avoid the subset-then-verify approach used in the test_subset.test_subset_bbox() function.


Thoughts? For instance, do you like the idea of checking for a group structure inside the compute_coordinate_variable_names function?

Yes perhaps compute_coordinate_variable_names should be an internal-only function. The reason we started using it outside of the module itself was because it makes Jupyter notebooks and examples more consistent if they are able to use the exact logic for determining coordinate variables as the subsetter itself does.

Okay, I've figured out a kind of workaround. Instead of refactoring compute_coordinate_variable_names(), I just separated <netcdf opening & group flattening> into a function and added that into the test suites just before compute_coordinate_variable_names() is called.

..you'll see that change in PR #130. Thoughts?

(please pardon the quick off-topic question: is there a plan to add type hints throughout the code base? Is that something that would be worthwhile for another issue and PR?)

(please pardon the quick off-topic question: is there a plan to add type hints throughout the code base? Is that something that would be worthwhile for another issue and PR?)

Yes I'm all for adding type hints

Okay, there's now a new issue (#136) for following up on type hints.