NCAR/esmlab

calling compute_ann_mean(ds) can lead to ds.time becoming encoded

Closed this issue · 7 comments

If I call compute_ann_mean(ds) on an xarray Dataset ds that has
an unencoded time coordinate, and a time bounds named ds.time.attrs['bounds'],
I'm seeing that the time coordinate in ds gets encoded by the compute_ann_mean call.
Changing aspects of ds in compute_ann_mean(ds) is unexpected behavior (to me).
I'm guessing that this is occurring at
https://github.com/NCAR/esmlab/blob/master/esmlab/climatology.py#L209

Below is some jupyter notebook code that demonstrates this.
I'm using the very latest commit of esmlab, be608ed, and
numpy 1.15.4
xarray 0.11.3

import numpy as np
import xarray as xr
import esmlab
# set up values for Dataset, 2 yrs of analytic monthly values
days_1yr = np.array([31.0, 28.0, 31.0, 30.0, 31.0, 30.0, 31.0, 31.0, 30.0, 31.0, 30.0, 31.0])
time_edges = np.insert(np.cumsum(np.concatenate((days_1yr, days_1yr))), 0, 0)
time_bounds_vals = np.stack((time_edges[:-1], time_edges[1:]), axis=1)
time_vals = np.mean(time_bounds_vals, axis=1)
var_vals = np.sin(np.pi * time_vals / 365.0)
# create Dataset, including time_bounds
time_var = xr.DataArray(time_vals, name='time', dims='time', coords={'time':time_vals},
                        attrs={'units':'days since 0001-01-01', 'calendar':'noleap',
                               'bounds':'time_bounds'})
time_bounds = xr.DataArray(time_bounds_vals, name='time_bounds', dims=('time', 'd2'),
                           coords={'time':time_var})
var = xr.DataArray(var_vals, name='var', dims='time', coords={'time':time_var})
ds = var.to_dataset()
ds = xr.merge((ds, time_bounds))

print('ds.time before compute_ann_mean')
print(ds.time)
ds_ann = esmlab.climatology.compute_ann_mean(ds)
print('')
print('ds.time after compute_ann_mean')
print(ds.time)
ds.time before compute_ann_mean
<xarray.DataArray 'time' (time: 24)>
array([ 15.5,  45. ,  74.5, 105. , 135.5, 166. , 196.5, 227.5, 258. , 288.5,
       319. , 349.5, 380.5, 410. , 439.5, 470. , 500.5, 531. , 561.5, 592.5,
       623. , 653.5, 684. , 714.5])
Coordinates:
  * time     (time) float64 15.5 45.0 74.5 105.0 ... 623.0 653.5 684.0 714.5
Attributes:
    units:     days since 0001-01-01
    calendar:  noleap
    bounds:    time_bounds

ds.time after compute_ann_mean
<xarray.DataArray 'time' (time: 24)>
array([cftime.DatetimeNoLeap(1, 1, 16, 12, 0, 0, 0, 2, 16),
       cftime.DatetimeNoLeap(1, 2, 15, 0, 0, 0, 0, 4, 46),
       cftime.DatetimeNoLeap(1, 3, 16, 12, 0, 0, 0, 5, 75),
       cftime.DatetimeNoLeap(1, 4, 16, 0, 0, 0, 0, 1, 106),
       cftime.DatetimeNoLeap(1, 5, 16, 12, 0, 0, 0, 3, 136),
       cftime.DatetimeNoLeap(1, 6, 16, 0, 0, 0, 0, 6, 167),
       cftime.DatetimeNoLeap(1, 7, 16, 12, 0, 0, 0, 1, 197),
       cftime.DatetimeNoLeap(1, 8, 16, 12, 0, 0, 0, 4, 228),
       cftime.DatetimeNoLeap(1, 9, 16, 0, 0, 0, 0, 0, 259),
       cftime.DatetimeNoLeap(1, 10, 16, 12, 0, 0, 0, 2, 289),
       cftime.DatetimeNoLeap(1, 11, 16, 0, 0, 0, 0, 5, 320),
       cftime.DatetimeNoLeap(1, 12, 16, 12, 0, 0, 0, 0, 350),
       cftime.DatetimeNoLeap(2, 1, 16, 12, 0, 0, 0, 3, 16),
       cftime.DatetimeNoLeap(2, 2, 15, 0, 0, 0, 0, 5, 46),
       cftime.DatetimeNoLeap(2, 3, 16, 12, 0, 0, 0, 6, 75),
       cftime.DatetimeNoLeap(2, 4, 16, 0, 0, 0, 0, 2, 106),
       cftime.DatetimeNoLeap(2, 5, 16, 12, 0, 0, 0, 4, 136),
       cftime.DatetimeNoLeap(2, 6, 16, 0, 0, 0, 0, 0, 167),
       cftime.DatetimeNoLeap(2, 7, 16, 12, 0, 0, 0, 2, 197),
       cftime.DatetimeNoLeap(2, 8, 16, 12, 0, 0, 0, 5, 228),
       cftime.DatetimeNoLeap(2, 9, 16, 0, 0, 0, 0, 1, 259),
       cftime.DatetimeNoLeap(2, 10, 16, 12, 0, 0, 0, 3, 289),
       cftime.DatetimeNoLeap(2, 11, 16, 0, 0, 0, 0, 6, 320),
       cftime.DatetimeNoLeap(2, 12, 16, 12, 0, 0, 0, 1, 350)], dtype=object)
Coordinates:
  * time     (time) object 0001-01-16 12:00:00 ... 0002-12-16 12:00:00
Attributes:
    units:     days since 0001-01-01
    calendar:  noleap
    bounds:    time_bounds

Changing aspects of ds in compute_ann_mean(ds) is unexpected behavior (to me).

We probably shouldn't alter the dataset time index without letting the user know.

For a temporary fix after computing compute_ann_mean, you can try the following:

import esmlab

ds = esmlab.utils.time.uncompute_time_var(ds, 'time')

I will patch this issue some time today.

@klindsay28 thanks for raising this issue. I think we need to decide what the desired behavior is. At present, esmlab is geared toward operating on datasets where time has not been decoded, but returning datasets with time decoded.

Time must be decoded internally to enable application of the groupby methods that are the core of the data-manipulations.

I think I agree that a more intuitive behavior would be to restore the time axis to its un-decoded state by default.

Can the internal decoding of the time variable be done on a copy of the function arguments, instead of on the function arguments themselves? This would avoid the need to restore the time axis, because it was never changed it in the first place.

Following up on the suggestion of @andersy005, I took a look at the related function esmlab.utils.time.compute_time_var. I see that in addition to returning an xr.Dataset with an encoded time coordinate, it also encodes the time coordinate in dset, its xr.Dataset argument. I would not have guessed this from the function name. Maybe I would have if the function were named encode_time_var.

@klindsay28, first, regarding the name of the function, I see your point, though we are doing more than encoding time: the function does actually compute time as the mid-point of the time_bounds.

We are using the xarray.core.groupby functionality for much of the computation. Your suggestion of not modifying the original dataset is a good one, but as far as I know this won't work with groupby. We would need to compute the groupby objects using the modified time axis and then subsequently apply them, but I don't think that's supported explicitly.

One option would be to keep the original time_coord_var on the dataset:

  1. rather than replacing "time", we would use xarray's set_coords and reset_coords methods to make our "new" time object the time coordinate (and the old time coordinate a variable defined on that coordinate);
  2. then apply the groupby using this coordinate;
  3. then set/reset to restore the old time coordinate.

I have mixed feelings about this: it's nice to have the dataset returned with all the functionality of time-handling enabled (i.e., with time decoded). However, I recognize that the current behavior is somewhat mysterious and counterintuitive. I think carrying the old time coord along for the computation is a good idea regardless, so maybe we can have a flag (or better yet a config option) for how to return time.

It looks like the calls to groupby in compute_ann_mean use the argument group=time.year. The docs for groupby at xarray.Dataset.groupby
state that group 'must be the name of a variable contained in this dataset'. There are no stated restrictions that it needs to be a coordinate.

I wonder if compute_time_var could add to dset a new variable computed_time_mid that is encoded, and this could be used in the groupby calls.

good point!

I wonder if compute_time_var could add to dset a new variable computed_time_mid that is encoded, and this could be used in the groupby calls.

This sounds like a great approach and could come in handy for other purposes too.