xarray-contrib/cf-xarray

Dimension operator adds undesired `_FillValue attribute` to boundary variables

Closed this issue · 2 comments

Context

Given the following example dataset:

$ ncdump -h toz.CF-1.8.nc 
netcdf toz.CF-1.8 {
dimensions:
        time = 12 ;
        bounds2 = 2 ;
        lat = 18 ;
        lon = 36 ;
variables:
        double time_bnds(time, bounds2) ;
        double time(time) ;
                time:standard_name = "time" ;
                time:units = "days since 2000-1-1" ;
                time:calendar = "gregorian" ;
                time:bounds = "time_bnds" ;
        double lat_bnds(lat, bounds2) ;
        double lat(lat) ;
                lat:standard_name = "latitude" ;
                lat:units = "degrees_north" ;
                lat:bounds = "lat_bnds" ;
        double lon_bnds(lon, bounds2) ;
        double lon(lon) ;
                lon:standard_name = "longitude" ;
                lon:units = "degrees_east" ;
                lon:bounds = "lon_bnds" ;
        double toz(time, lat, lon) ;
                toz:standard_name = "atmosphere_mole_content_of_ozone" ;
                toz:long_name = "Total Ozone Column" ;
                toz:units = "mol m-2" ;
                toz:cell_methods = "area: mean time: maximum (interval: 1 day)" ;

// global attributes:
                :Conventions = "CF-1.8" ;
                :title = "Ozone data mockup" ;
                :comment = "Ozone mockup data generated for testing purposes" ;
                :institution = "o3skim code" ;
                :history = "File created on 2022-02-17 10:14:04.458982" ;
                :source = "Random generation" ;
}

Which is correct:

$ cfchecks -v 1.8 toz.CF-1.8.nc 
CHECKING NetCDF FILE: toz.CF-1.8.nc
=====================
Using CF Checker Version 4.1.0
Checking against CF Version CF-1.8
Using Standard Name Table Version 78 (2021-09-21T11:55:06Z)
Using Area Type Table Version 10 (23 June 2020)
Using Standardized Region Name Table Version 4 (18 December 2018)


------------------
Checking variable: time_bnds
------------------

------------------
Checking variable: time
------------------

------------------
Checking variable: lat_bnds
------------------

------------------
Checking variable: lat
------------------

------------------
Checking variable: lon_bnds
------------------

------------------
Checking variable: lon
------------------

------------------
Checking variable: toz
------------------

ERRORS detected: 0
WARNINGS given: 0
INFORMATION messages: 0

Issue

Generates a wrong output after applying an operation, for example mean:

>>> ds = xr.load_dataset("toz.CF-1.8.nc")
>>> ds.cf.mean("latitude").to_netcdf("toz-reduced.CF-1.8.nc")
$ cfchecks -v 1.8 toz-reduced.CF-1.8.nc 
CHECKING NetCDF FILE: toz-reduced.CF-1.8.nc
=====================
Using CF Checker Version 4.1.0
Checking against CF Version CF-1.8
Using Standard Name Table Version 78 (2021-09-21T11:55:06Z)
Using Area Type Table Version 10 (23 June 2020)
Using Standardized Region Name Table Version 4 (18 December 2018)

ERROR: (7.1): boundary variable with non-numeric data type

------------------
Checking variable: time_bnds
------------------

------------------
Checking variable: time
------------------

------------------
Checking variable: lat_bnds
------------------
WARN: (3): No standard_name or long_name attribute specified
INFO: (3.1): No units attribute set.  Please consider adding a units attribute for completeness.

------------------
Checking variable: lon_bnds
------------------
WARN: (7.1): Boundary Variable lon_bnds should not have _FillValue attribute

------------------
Checking variable: lon
------------------

------------------
Checking variable: toz
------------------

ERRORS detected: 1
WARNINGS given: 2
INFORMATION messages: 1

Aditional information

For additional information, here are the reduced netCDF headers:

$ ncdump -h toz-reduced.CF-1.8.nc
netcdf toz-reduced.CF-1.8 {
dimensions:
        time = 12 ;
        bounds2 = 2 ;
        lon = 36 ;
variables:
        int64 time_bnds(time, bounds2) ;
        double time(time) ;
                time:_FillValue = NaN ;
                time:standard_name = "time" ;
                time:bounds = "time_bnds" ;
                time:units = "days since 2000-01-01" ;
                time:calendar = "gregorian" ;
        double lat_bnds(bounds2) ;
                lat_bnds:_FillValue = NaN ;
        double lon_bnds(lon, bounds2) ;
                lon_bnds:_FillValue = NaN ;
        double lon(lon) ;
                lon:_FillValue = NaN ;
                lon:standard_name = "longitude" ;
                lon:units = "degrees_east" ;
                lon:bounds = "lon_bnds" ;
        double toz(time, lon) ;
                toz:_FillValue = NaN ;
                toz:standard_name = "atmosphere_mole_content_of_ozone" ;
                toz:long_name = "Total Ozone Column" ;
                toz:units = "mol m-2" ;
                toz:cell_methods = "area: mean time: maximum (interval: 1 day)" ;

// global attributes:
                :Conventions = "CF-1.8" ;
                :title = "Ozone data mockup" ;
                :comment = "Ozone mockup data generated for testing purposes" ;
                :institution = "o3skim code" ;
                :history = "File created on 2022-02-17 10:14:04.458982" ;
                :source = "Random generation" ;
}

And here the cf_xarray version information:

$ pip show cf_xarray
Name: cf-xarray
Version: 0.6.1
Summary: A lightweight convenience wrapper for using CF attributes on xarray objects. 
Home-page: https://cf-xarray.readthedocs.io
Author: None
Author-email: None
License: Apache
Location: /home/borja/miniconda3/envs/o3as/lib/python3.8/site-packages
Requires: pandas, numpy, setuptools
Required-by: 

This is an xarray issue: pydata/xarray#2037 It would be useful if you could comment there.

I see, that is quite painful. I am posting here a temporal solution to this in case someone has the same issue.
Based on https://stackoverflow.com/questions/45693688/xarray-automatically-applying-fillvalue-to-coordinates-on-netcdf-output

for var in ds.variables:
    if '_FillValue' not in ds[var].attrs:
        ds[var].attrs['_FillValue'] = False

Also possible to run it like:

for var in variables.keys():
    if variables[var].dims == (var,):
        variables[var].attrs["_FillValue"] = None
        if "bounds" in variables[var].attrs:
            bnds = variables[var].attrs['bounds']
            variables[bnds].attrs["_FillValue"] = None