Dimension operator generates `boundary variable with non-numeric data type` CF error
Closed this issue · 3 comments
Context
Given the following example dataset:
$ ncdump -h toz.CF-1.8.nc
netcdf toz.CF-1.8 {
dimensions:
time = 12 ;
bounds2 = 2 ;
lat = 18 ;
lon = 36 ;
variables:
double time_bnds(time, bounds2) ;
double time(time) ;
time:standard_name = "time" ;
time:units = "days since 2000-1-1" ;
time:calendar = "gregorian" ;
time:bounds = "time_bnds" ;
double lat_bnds(lat, bounds2) ;
double lat(lat) ;
lat:standard_name = "latitude" ;
lat:units = "degrees_north" ;
lat:bounds = "lat_bnds" ;
double lon_bnds(lon, bounds2) ;
double lon(lon) ;
lon:standard_name = "longitude" ;
lon:units = "degrees_east" ;
lon:bounds = "lon_bnds" ;
double toz(time, lat, lon) ;
toz:standard_name = "atmosphere_mole_content_of_ozone" ;
toz:long_name = "Total Ozone Column" ;
toz:units = "mol m-2" ;
toz:cell_methods = "area: mean time: maximum (interval: 1 day)" ;
// global attributes:
:Conventions = "CF-1.8" ;
:title = "Ozone data mockup" ;
:comment = "Ozone mockup data generated for testing purposes" ;
:institution = "o3skim code" ;
:history = "File created on 2022-02-17 10:14:04.458982" ;
:source = "Random generation" ;
}
Which is correct:
$ cfchecks -v 1.8 toz.CF-1.8.nc
CHECKING NetCDF FILE: toz.CF-1.8.nc
=====================
Using CF Checker Version 4.1.0
Checking against CF Version CF-1.8
Using Standard Name Table Version 78 (2021-09-21T11:55:06Z)
Using Area Type Table Version 10 (23 June 2020)
Using Standardized Region Name Table Version 4 (18 December 2018)
------------------
Checking variable: time_bnds
------------------
------------------
Checking variable: time
------------------
------------------
Checking variable: lat_bnds
------------------
------------------
Checking variable: lat
------------------
------------------
Checking variable: lon_bnds
------------------
------------------
Checking variable: lon
------------------
------------------
Checking variable: toz
------------------
ERRORS detected: 0
WARNINGS given: 0
INFORMATION messages: 0
Issue
Generates a wrong output after applying an operation, for example mean
:
>>> ds = xr.load_dataset("toz.CF-1.8.nc")
>>> ds.cf.mean("latitude").to_netcdf("toz-reduced.CF-1.8.nc")
$ cfchecks -v 1.8 toz-reduced.CF-1.8.nc
CHECKING NetCDF FILE: toz-reduced.CF-1.8.nc
=====================
Using CF Checker Version 4.1.0
Checking against CF Version CF-1.8
Using Standard Name Table Version 78 (2021-09-21T11:55:06Z)
Using Area Type Table Version 10 (23 June 2020)
Using Standardized Region Name Table Version 4 (18 December 2018)
ERROR: (7.1): boundary variable with non-numeric data type
------------------
Checking variable: time_bnds
------------------
------------------
Checking variable: time
------------------
------------------
Checking variable: lat_bnds
------------------
WARN: (3): No standard_name or long_name attribute specified
INFO: (3.1): No units attribute set. Please consider adding a units attribute for completeness.
------------------
Checking variable: lon_bnds
------------------
WARN: (7.1): Boundary Variable lon_bnds should not have _FillValue attribute
------------------
Checking variable: lon
------------------
------------------
Checking variable: toz
------------------
ERRORS detected: 1
WARNINGS given: 2
INFORMATION messages: 1
Aditional information
For additional information, here are the reduced netCDF headers:
$ ncdump -h toz-reduced.CF-1.8.nc
netcdf toz-reduced.CF-1.8 {
dimensions:
time = 12 ;
bounds2 = 2 ;
lon = 36 ;
variables:
int64 time_bnds(time, bounds2) ;
double time(time) ;
time:_FillValue = NaN ;
time:standard_name = "time" ;
time:bounds = "time_bnds" ;
time:units = "days since 2000-01-01" ;
time:calendar = "gregorian" ;
double lat_bnds(bounds2) ;
lat_bnds:_FillValue = NaN ;
double lon_bnds(lon, bounds2) ;
lon_bnds:_FillValue = NaN ;
double lon(lon) ;
lon:_FillValue = NaN ;
lon:standard_name = "longitude" ;
lon:units = "degrees_east" ;
lon:bounds = "lon_bnds" ;
double toz(time, lon) ;
toz:_FillValue = NaN ;
toz:standard_name = "atmosphere_mole_content_of_ozone" ;
toz:long_name = "Total Ozone Column" ;
toz:units = "mol m-2" ;
toz:cell_methods = "area: mean time: maximum (interval: 1 day)" ;
// global attributes:
:Conventions = "CF-1.8" ;
:title = "Ozone data mockup" ;
:comment = "Ozone mockup data generated for testing purposes" ;
:institution = "o3skim code" ;
:history = "File created on 2022-02-17 10:14:04.458982" ;
:source = "Random generation" ;
}
And here the cf_xarray version information:
$ pip show cf_xarray
Name: cf-xarray
Version: 0.6.1
Summary: A lightweight convenience wrapper for using CF attributes on xarray objects.
Home-page: https://cf-xarray.readthedocs.io
Author: None
Author-email: None
License: Apache
Location: /home/borja/miniconda3/envs/o3as/lib/python3.8/site-packages
Requires: pandas, numpy, setuptools
Required-by:
Which variable is it talking about? That CDL only shows numeric types IIUC
It took me a while to notice it: time_bnds
, see:
$ ncdump -h toz.CF-1.8.nc
...
variables:
double time_bnds(time, bounds2) ;
...
vs
$ ncdump -h toz-reduced.CF-1.8.nc
...
variables:
int64 time_bnds(time, bounds2) ;
...
After the operator, the datatype is changed from double to int64 which seems it is not a correct unit.
There are 3 things to consider here:
- Why does cf_xarray change the datatype? Should not better maintain it or keep the same type than the variable it is bounding?
- Is int64 a correct data type for
Time Coordinate bound
? I did not find anything related in the specs. - cfchecker is not correctly evaluating
int64
variables. I did already opened a issue, I will link it.
Regarding cf_xarray, in my opinion, it should not change the datatype, the provider of the original data should have been carefull enough to do not provide bounds with different datatypes. However, even if it is the case, it might still be correct (but confusing as a default case).
This seems to be xarray choosing int
over float
when writing datetime variables. This is usually a good idea because it avoids floating point inaccuracies.
The cf-checker
issue does look like a bug.