Potentially incorrect representation of variables in data cube extension
dchandan opened this issue · 8 comments
Consider the example at: https://redoak.cs.toronto.edu/stac/collections/CMIP6_UofT/items/CMIP_EC-Earth-Consortium_EC-Earth3_historical_r21i1p1f1_Amon_clt_gr
The cube variables listed are:
"cube:variables": {
"clt": {
"type": "data",
"unit": "%",
"dimensions": [
"time",
"lat",
"lon"
],
"description": "Total Cloud Fraction"
},
"lat_bnds": {
"type": "data",
"unit": "",
"dimensions": [
"lat",
"bnds"
],
"description": ""
},
"lon_bnds": {
"type": "data",
"unit": "",
"dimensions": [
"lon",
"bnds"
],
"description": ""
},
"time_bnds": {
"type": "data",
"unit": "days since 1850-01-01",
"dimensions": [
"time",
"bnds"
],
"description": ""
}
},
But, I see two problems:
- Variables like
time
,lat
,lon
are missing. I think this has to do with these lines. I don't think this is correct. - Shouldn't all the bounds variables be listed as auxiliary variables (as per CF terminology) rather than data variables? #51 partly address this.
@huard you wrote the data cube extension helper codes, what are your thoughts on this?
CF-xarray parses the file as:
Coordinates:
CF Axes: * X: ['lon']
* Y: ['lat']
* T: ['time']
Z: n/a
CF Coordinates: * longitude: ['lon']
* latitude: ['lat']
* time: ['time']
vertical: n/a
Cell Measures: area, volume: n/a
Standard Names: * latitude: ['lat']
* longitude: ['lon']
* time: ['time']
Bounds: n/a
Grid Mappings: n/a
Data Variables:
Cell Measures: area, volume: n/a
Standard Names: cloud_area_fraction: ['clt']
Bounds: T: ['time_bnds']
X: ['lon_bnds']
Y: ['lat_bnds']
lat: ['lat_bnds']
latitude: ['lat_bnds']
lon: ['lon_bnds']
longitude: ['lon_bnds']
time: ['time_bnds']
Grid Mappings: n/a
You're saying CF-xarray is able to parse things correctly? I mean with regards to the bounds.
What are your thoughts on the missing variables I mentioned above? Should we remove the lines I pointed to as the likely culprit?
yes, I got that output by doing
l = "https://redoak.cs.toronto.edu/twitcher/ows/proxy/thredds/dodsC/datasets/CMIP6/CMIP/EC-Earth-Consortium/EC-Earth3/historical/r2i1p1f1/Amon/clt/gr/v20201215/clt_Amon_EC-Earth3_historical_r2i1
...: p1f1_gr_185001-201412.nc"
ds = xr.open_dataset(l)
ds.cf
Need to look into it more.
Took a bit of time to review this and refresh my memory.
Variables
I decided to put time, lon, lat in the dimensions attributes instead of the variables. I mean, they could be in both, but I didn't see how this would be useful from a catalogue perspective, where searching for variables is the primary usage. Open to counter arguments.
Bounds
Agree with the second point. I looked at how CF-xarray does it and will prepare a small PR to port the logic here, and add a test.
I think we can leave lat, lon and time as both dimensions and variables, in case a future use case needs this comprehensive description provided by the data cube extension.
But doesn't matter to me. David, you choose.
I'll ask around to see what people think.
For reference the Microsoft collection lists lat, lon time as dimensions only.
https://planetarycomputer-staging.microsoft.com/api/stac/v1/collections/nasa-nex-gddp-cmip6/items/UKESM1-0-LL.ssp585.2100