calling compute_ann_mean(ds) can lead to ds.time becoming encoded
Closed this issue · 7 comments
If I call compute_ann_mean(ds) on an xarray Dataset ds that has
an unencoded time coordinate, and a time bounds named ds.time.attrs['bounds'],
I'm seeing that the time coordinate in ds gets encoded by the compute_ann_mean call.
Changing aspects of ds in compute_ann_mean(ds) is unexpected behavior (to me).
I'm guessing that this is occurring at
https://github.com/NCAR/esmlab/blob/master/esmlab/climatology.py#L209
Below is some jupyter notebook code that demonstrates this.
I'm using the very latest commit of esmlab, be608ed, and
numpy 1.15.4
xarray 0.11.3
import numpy as np
import xarray as xr
import esmlab
# set up values for Dataset, 2 yrs of analytic monthly values
days_1yr = np.array([31.0, 28.0, 31.0, 30.0, 31.0, 30.0, 31.0, 31.0, 30.0, 31.0, 30.0, 31.0])
time_edges = np.insert(np.cumsum(np.concatenate((days_1yr, days_1yr))), 0, 0)
time_bounds_vals = np.stack((time_edges[:-1], time_edges[1:]), axis=1)
time_vals = np.mean(time_bounds_vals, axis=1)
var_vals = np.sin(np.pi * time_vals / 365.0)
# create Dataset, including time_bounds
time_var = xr.DataArray(time_vals, name='time', dims='time', coords={'time':time_vals},
attrs={'units':'days since 0001-01-01', 'calendar':'noleap',
'bounds':'time_bounds'})
time_bounds = xr.DataArray(time_bounds_vals, name='time_bounds', dims=('time', 'd2'),
coords={'time':time_var})
var = xr.DataArray(var_vals, name='var', dims='time', coords={'time':time_var})
ds = var.to_dataset()
ds = xr.merge((ds, time_bounds))
print('ds.time before compute_ann_mean')
print(ds.time)
ds_ann = esmlab.climatology.compute_ann_mean(ds)
print('')
print('ds.time after compute_ann_mean')
print(ds.time)
ds.time before compute_ann_mean
<xarray.DataArray 'time' (time: 24)>
array([ 15.5, 45. , 74.5, 105. , 135.5, 166. , 196.5, 227.5, 258. , 288.5,
319. , 349.5, 380.5, 410. , 439.5, 470. , 500.5, 531. , 561.5, 592.5,
623. , 653.5, 684. , 714.5])
Coordinates:
* time (time) float64 15.5 45.0 74.5 105.0 ... 623.0 653.5 684.0 714.5
Attributes:
units: days since 0001-01-01
calendar: noleap
bounds: time_bounds
ds.time after compute_ann_mean
<xarray.DataArray 'time' (time: 24)>
array([cftime.DatetimeNoLeap(1, 1, 16, 12, 0, 0, 0, 2, 16),
cftime.DatetimeNoLeap(1, 2, 15, 0, 0, 0, 0, 4, 46),
cftime.DatetimeNoLeap(1, 3, 16, 12, 0, 0, 0, 5, 75),
cftime.DatetimeNoLeap(1, 4, 16, 0, 0, 0, 0, 1, 106),
cftime.DatetimeNoLeap(1, 5, 16, 12, 0, 0, 0, 3, 136),
cftime.DatetimeNoLeap(1, 6, 16, 0, 0, 0, 0, 6, 167),
cftime.DatetimeNoLeap(1, 7, 16, 12, 0, 0, 0, 1, 197),
cftime.DatetimeNoLeap(1, 8, 16, 12, 0, 0, 0, 4, 228),
cftime.DatetimeNoLeap(1, 9, 16, 0, 0, 0, 0, 0, 259),
cftime.DatetimeNoLeap(1, 10, 16, 12, 0, 0, 0, 2, 289),
cftime.DatetimeNoLeap(1, 11, 16, 0, 0, 0, 0, 5, 320),
cftime.DatetimeNoLeap(1, 12, 16, 12, 0, 0, 0, 0, 350),
cftime.DatetimeNoLeap(2, 1, 16, 12, 0, 0, 0, 3, 16),
cftime.DatetimeNoLeap(2, 2, 15, 0, 0, 0, 0, 5, 46),
cftime.DatetimeNoLeap(2, 3, 16, 12, 0, 0, 0, 6, 75),
cftime.DatetimeNoLeap(2, 4, 16, 0, 0, 0, 0, 2, 106),
cftime.DatetimeNoLeap(2, 5, 16, 12, 0, 0, 0, 4, 136),
cftime.DatetimeNoLeap(2, 6, 16, 0, 0, 0, 0, 0, 167),
cftime.DatetimeNoLeap(2, 7, 16, 12, 0, 0, 0, 2, 197),
cftime.DatetimeNoLeap(2, 8, 16, 12, 0, 0, 0, 5, 228),
cftime.DatetimeNoLeap(2, 9, 16, 0, 0, 0, 0, 1, 259),
cftime.DatetimeNoLeap(2, 10, 16, 12, 0, 0, 0, 3, 289),
cftime.DatetimeNoLeap(2, 11, 16, 0, 0, 0, 0, 6, 320),
cftime.DatetimeNoLeap(2, 12, 16, 12, 0, 0, 0, 1, 350)], dtype=object)
Coordinates:
* time (time) object 0001-01-16 12:00:00 ... 0002-12-16 12:00:00
Attributes:
units: days since 0001-01-01
calendar: noleap
bounds: time_bounds
Changing aspects of ds in compute_ann_mean(ds) is unexpected behavior (to me).
We probably shouldn't alter the dataset time index without letting the user know.
For a temporary fix after computing compute_ann_mean
, you can try the following:
import esmlab
ds = esmlab.utils.time.uncompute_time_var(ds, 'time')
I will patch this issue some time today.
@klindsay28 thanks for raising this issue. I think we need to decide what the desired behavior is. At present, esmlab
is geared toward operating on datasets where time has not been decoded, but returning datasets with time decoded.
Time must be decoded internally to enable application of the groupby
methods that are the core of the data-manipulations.
I think I agree that a more intuitive behavior would be to restore the time axis to its un-decoded state by default.
Can the internal decoding of the time variable be done on a copy of the function arguments, instead of on the function arguments themselves? This would avoid the need to restore the time axis, because it was never changed it in the first place.
Following up on the suggestion of @andersy005, I took a look at the related function esmlab.utils.time.compute_time_var
. I see that in addition to returning an xr.Dataset
with an encoded time coordinate, it also encodes the time coordinate in dset
, its xr.Dataset
argument. I would not have guessed this from the function name. Maybe I would have if the function were named encode_time_var
.
@klindsay28, first, regarding the name of the function, I see your point, though we are doing more than encoding time: the function does actually compute time as the mid-point of the time_bounds.
We are using the xarray.core.groupby
functionality for much of the computation. Your suggestion of not modifying the original dataset is a good one, but as far as I know this won't work with groupby
. We would need to compute the groupby
objects using the modified time axis and then subsequently apply them, but I don't think that's supported explicitly.
One option would be to keep the original time_coord_var
on the dataset:
- rather than replacing "time", we would use xarray's
set_coords
andreset_coords
methods to make our "new" time object the time coordinate (and the old time coordinate a variable defined on that coordinate); - then apply the
groupby
using this coordinate; - then set/reset to restore the old time coordinate.
I have mixed feelings about this: it's nice to have the dataset returned with all the functionality of time-handling enabled (i.e., with time decoded). However, I recognize that the current behavior is somewhat mysterious and counterintuitive. I think carrying the old time coord along for the computation is a good idea regardless, so maybe we can have a flag (or better yet a config option) for how to return time.
It looks like the calls to groupby
in compute_ann_mean
use the argument group=time.year
. The docs for groupby
at xarray.Dataset.groupby
state that group
'must be the name of a variable contained in this dataset'. There are no stated restrictions that it needs to be a coordinate.
I wonder if compute_time_var
could add to dset
a new variable computed_time_mid
that is encoded, and this could be used in the groupby
calls.
good point!
I wonder if
compute_time_var
could add todset
a new variablecomputed_time_mid
that is encoded, and this could be used in thegroupby
calls.
This sounds like a great approach and could come in handy for other purposes too.