crim-ca/stac-populator

Potentially incorrect representation of variables in data cube extension

dchandan opened this issue · 8 comments

Consider the example at: https://redoak.cs.toronto.edu/stac/collections/CMIP6_UofT/items/CMIP_EC-Earth-Consortium_EC-Earth3_historical_r21i1p1f1_Amon_clt_gr

The cube variables listed are:

"cube:variables": {
      "clt": {
        "type": "data",
        "unit": "%",
        "dimensions": [
          "time",
          "lat",
          "lon"
        ],
        "description": "Total Cloud Fraction"
      },
      "lat_bnds": {
        "type": "data",
        "unit": "",
        "dimensions": [
          "lat",
          "bnds"
        ],
        "description": ""
      },
      "lon_bnds": {
        "type": "data",
        "unit": "",
        "dimensions": [
          "lon",
          "bnds"
        ],
        "description": ""
      },
      "time_bnds": {
        "type": "data",
        "unit": "days since 1850-01-01",
        "dimensions": [
          "time",
          "bnds"
        ],
        "description": ""
      }
    },

But, I see two problems:

  1. Variables like time, lat, lon are missing. I think this has to do with these lines. I don't think this is correct.
  2. Shouldn't all the bounds variables be listed as auxiliary variables (as per CF terminology) rather than data variables? #51 partly address this.

@huard you wrote the data cube extension helper codes, what are your thoughts on this?

CF-xarray parses the file as:

Coordinates:
             CF Axes: * X: ['lon']
                      * Y: ['lat']
                      * T: ['time']
                        Z: n/a

      CF Coordinates: * longitude: ['lon']
                      * latitude: ['lat']
                      * time: ['time']
                        vertical: n/a

       Cell Measures:   area, volume: n/a

      Standard Names: * latitude: ['lat']
                      * longitude: ['lon']
                      * time: ['time']

              Bounds:   n/a

       Grid Mappings:   n/a

Data Variables:
       Cell Measures:   area, volume: n/a

      Standard Names:   cloud_area_fraction: ['clt']

              Bounds:   T: ['time_bnds']
                        X: ['lon_bnds']
                        Y: ['lat_bnds']
                        lat: ['lat_bnds']
                        latitude: ['lat_bnds']
                        lon: ['lon_bnds']
                        longitude: ['lon_bnds']
                        time: ['time_bnds']

       Grid Mappings:   n/a

You're saying CF-xarray is able to parse things correctly? I mean with regards to the bounds.

What are your thoughts on the missing variables I mentioned above? Should we remove the lines I pointed to as the likely culprit?

yes, I got that output by doing

l = "https://redoak.cs.toronto.edu/twitcher/ows/proxy/thredds/dodsC/datasets/CMIP6/CMIP/EC-Earth-Consortium/EC-Earth3/historical/r2i1p1f1/Amon/clt/gr/v20201215/clt_Amon_EC-Earth3_historical_r2i1
    ...: p1f1_gr_185001-201412.nc"
ds = xr.open_dataset(l)
ds.cf

Need to look into it more.

Took a bit of time to review this and refresh my memory.

Variables

I decided to put time, lon, lat in the dimensions attributes instead of the variables. I mean, they could be in both, but I didn't see how this would be useful from a catalogue perspective, where searching for variables is the primary usage. Open to counter arguments.

Bounds

Agree with the second point. I looked at how CF-xarray does it and will prepare a small PR to port the logic here, and add a test.

I think we can leave lat, lon and time as both dimensions and variables, in case a future use case needs this comprehensive description provided by the data cube extension.

But doesn't matter to me. David, you choose.

I'll ask around to see what people think.