Move general functionality upstream
andersy005 opened this issue · 9 comments
While working on #156 and per discussion with @dcherian, I learned that esmlab's codebase is predominantly full of hacks.
I don't know the code that well at all (!) but this stuff looks sane.
I myself have started having a hard time understanding the existing codebase. It's been a while since I looked at esmlab's code base. So, I spent the last four days looking at the existing code base, and my take-away is that esmlab's codebase is full of hacks. And this makes it difficult to debug or even maintain :(
I am personally in favor of helping with pushing most of the general functionality into xarray and only keeping things that are domain-specific. I just found out that there has been recent progress in pydata/xarray#2922, and I am excited about this. Once pydata/xarray#2922 is merged, we can move most if not all existing functionality from https://github.com/NCAR/esmlab/blob/master/esmlab/statistics.py into xarray.
Originally posted by @andersy005 in #156 (comment)
Yeah +1 on moving things upstream.
From a quick look, it seems like esmlab basically adds time-aware weighted groupby/resampling; time bounds consistency; and time units/calendargpropagation (pydata/xarray#1614).
For the last one, it would be good to document how much of the table in pydata/xarray#1614 is implemented and what's left.
Originally posted by @dcherian in #156 (comment)
From a quick look, it seems like esmlab basically adds time-aware weighted groupby/resampling; time bounds consistency; and time units/calendargpropagation
This is correct.
We are relying on the time bounds for all time-aware operations.
- Decoding Time: Internally, esmlab decodes time it by using the time bounds variable. Xarray doesn't seem to be aware of time-bounds variable when decoding time:
In [1]: import xarray as xr
In [2]: import numpy as np
...: ds['variable_1'] = xr.DataArray(
...: np.append(
...: np.zeros([12, 2, 2], dtype='float32'), np.ones([12, 2, 2], dtype='float32'), axis=0
...: ),
...: dims=['time', 'lat', 'lon'],
...: )
...: ds['variable_2'] = xr.DataArray(
...: np.append(
...: np.ones([12, 2, 2], dtype='float32'), np.zeros([12, 2, 2], dtype='float32'), axis=0
...: ),
...: dims=['time', 'lat', 'lon'],
...: )
...: ds.time.attrs['units'] = 'days since 0001-01-01 00:00:00'
...: ds.time.attrs['calendar'] = 'noleap'
...: ds.time.attrs['bounds'] = 'time_bound'
...: return ds.copy(True)
...:
In [4]: ds = create_dataset()
In [5]: ds
Out[5]:
<xarray.Dataset>
Dimensions: (d2: 2, lat: 2, lon: 2, time: 24)
Coordinates:
* lat (lat) int64 0 1
* lon (lon) int64 0 1
* time (time) float64 31.0 59.0 90.0 120.0 ... 638.0 669.0 699.0 730.0
* d2 (d2) int64 0 1
Data variables:
time_bound (time, d2) float64 0.0 31.0 31.0 59.0 ... 699.0 699.0 730.0
variable_1 (time, lat, lon) float32 0.0 0.0 0.0 0.0 0.0 ... 1.0 1.0 1.0 1.0
variable_2 (time, lat, lon) float32 1.0 1.0 1.0 1.0 1.0 ... 0.0 0.0 0.0 0.0
In [6]: xr.decode_cf(ds)
Out[6]:
<xarray.Dataset>
Dimensions: (d2: 2, lat: 2, lon: 2, time: 24)
Coordinates:
* lat (lat) int64 0 1
* lon (lon) int64 0 1
* time (time) object 0001-02-01 00:00:00 ... 0003-01-01 00:00:00
* d2 (d2) int64 0 1
Data variables:
time_bound (time, d2) object ...
variable_1 (time, lat, lon) float32 ...
variable_2 (time, lat, lon) float32 ...
Notice how the time axis is shifted to the right by one month.
Ideally, you would want the time to be:
In [10]: import cftime
In [14]: cftime.num2date(ds.time_bound.mean(dim='d2'), units='days since 0001-01-01 00:00:00', calendar='noleap')
Out[14]:
array([cftime.DatetimeNoLeap(0001-01-16 12:00:00),
cftime.DatetimeNoLeap(0001-02-15 00:00:00),
cftime.DatetimeNoLeap(0001-03-16 12:00:00),
cftime.DatetimeNoLeap(0001-04-16 00:00:00),
cftime.DatetimeNoLeap(0001-05-16 12:00:00),
cftime.DatetimeNoLeap(0001-06-16 00:00:00),
cftime.DatetimeNoLeap(0001-07-16 12:00:00),
cftime.DatetimeNoLeap(0001-08-16 12:00:00),
cftime.DatetimeNoLeap(0001-09-16 00:00:00),
cftime.DatetimeNoLeap(0001-10-16 12:00:00),
cftime.DatetimeNoLeap(0001-11-16 00:00:00),
cftime.DatetimeNoLeap(0001-12-16 12:00:00),
cftime.DatetimeNoLeap(0002-01-16 12:00:00),
cftime.DatetimeNoLeap(0002-02-15 00:00:00),
cftime.DatetimeNoLeap(0002-03-16 12:00:00),
cftime.DatetimeNoLeap(0002-04-16 00:00:00),
cftime.DatetimeNoLeap(0002-05-16 12:00:00),
cftime.DatetimeNoLeap(0002-06-16 00:00:00),
cftime.DatetimeNoLeap(0002-07-16 12:00:00),
cftime.DatetimeNoLeap(0002-08-16 12:00:00),
cftime.DatetimeNoLeap(0002-09-16 00:00:00),
cftime.DatetimeNoLeap(0002-10-16 12:00:00),
cftime.DatetimeNoLeap(0002-11-16 00:00:00),
cftime.DatetimeNoLeap(0002-12-16 12:00:00)], dtype=object)
Is this something that is reasonable to implement in xarray or is too domain-specific?
-
Wrong results when applying ops such as
resample()
on the time axis. Because of the way xarray decodes the time, the user ends up getting wrong results when they try to do some operations along the time axis:- For instance, when applying annual resampling on the above dataset, we end up getting three years instead of two years:
In [19]: ds.resample(time='A').mean()
Out[19]:
<xarray.Dataset>
Dimensions: (d2: 2, lat: 2, lon: 2, time: 3)
Coordinates:
* time (time) object 0001-12-31 00:00:00 ... 0003-12-31 00:00:00
* lat (lat) int64 0 1
* d2 (d2) int64 0 1
* lon (lon) int64 0 1
Data variables:
variable_1 (time, lat, lon) float32 0.0 0.0 0.0 0.0 ... 1.0 1.0 1.0 1.0
variable_2 (time, lat, lon) float32 1.0 1.0 1.0 1.0 ... 0.0 0.0 0.0 0.0
- xarray discards the
time_bounds
variable. As a user, I would expect xarray to keep the time_bounds variable, and generate new values for it accordingly. For instance:
In [26]: esmlab.resample(create_dataset(), freq='ann')
Out[26]:
<xarray.Dataset>
Dimensions: (d2: 2, lat: 2, lon: 2, time: 2)
Coordinates:
* lat (lat) int64 0 1
* lon (lon) int64 0 1
* time (time) float64 181.7 546.7
* d2 (d2) int64 0 1
Data variables:
time_bound (d2, time) float64 0.0 365.0 365.0 730.0
variable_1 (time, lat, lon) float64 0.0 0.0 0.0 0.0 1.0 1.0 1.0 1.0
variable_2 (time, lat, lon) float64 1.0 1.0 1.0 1.0 0.0 0.0 0.0 0.0
Is it reasonable to have this in xarray or is domain specific (i.e. should we handle this in esmlab?) ?
xarray discards the time_bounds variable. As a user, I would expect xarray to keep the time_bounds variable, and generate new values for it accordingly.
#142 proposes a method to generate the new time-coordinate.
-
Is this something that is reasonable to implement in xarray or is too domain-specific?
It is inconvenient but I don't think its wrong. The file is intentionally saved with time at the end of the month. Why wasn't it saved with the midpoint of time bounds in the first place? In any case, this won't fly (pydata/xarray#2231 (comment)). xarray basically does not deal with relationships between variables and that is unlikely to change. Our only exception (I think) is making sure bounds variables are encoded and decoded with the same units as the grid variable.
-
As above.
-
This looks like a bug to me. As a data variable, time_bounds should not be getting dropped. (coordinate variables get dropped pydata/xarray#2944 but I think that should change). Your code is incomplete (missing def create_dataset...). can you update that; i'll take a look.
Some of this might get solved eventually (pydata/xarray#1475) when xarray adds more fancy indexes but that's a longer term project.
I should be clear: esmlab adds very useful capabilities and a package like it is definitely required. It's just that general things like machinery for weighted operations should be moved upstream.
Your code is incomplete (missing def create_dataset...). can you update that; i'll take a look.
My bad. I partially pasted the code. Here's the complete version:
import xarray as xr
import numpy as np
def create_dataset():
start_date = np.array([0, 31, 59, 90, 120, 151, 181, 212, 243, 273, 304, 334], dtype=np.float64)
start_date = np.append(start_date, start_date + 365)
end_date = np.array([31, 59, 90, 120, 151, 181, 212, 243, 273, 304, 334, 365], dtype=np.float64)
end_date = np.append(end_date, end_date + 365)
ds = xr.Dataset(coords={'time': 24, 'lat': 2, 'lon': 2, 'd2': 2})
ds['time'] = xr.DataArray(end_date, dims='time')
ds['lat'] = xr.DataArray([0, 1], dims='lat')
ds['lon'] = xr.DataArray([0, 1], dims='lon')
ds['d2'] = xr.DataArray([0, 1], dims='d2')
ds['time_bound'] = xr.DataArray(
np.array([start_date, end_date]).transpose(), dims=['time', 'd2']
)
ds['variable_1'] = xr.DataArray(
np.append(
np.zeros([12, 2, 2], dtype='float32'), np.ones([12, 2, 2], dtype='float32'), axis=0
),
dims=['time', 'lat', 'lon'],
)
ds['variable_2'] = xr.DataArray(
np.append(
np.ones([12, 2, 2], dtype='float32'), np.zeros([12, 2, 2], dtype='float32'), axis=0
),
dims=['time', 'lat', 'lon'],
)
ds.time.encoding['units'] = 'days since 0001-01-01 00:00:00'
ds.time.encoding['calendar'] = 'noleap'
ds.time.encoding['bounds'] = 'time_bound'
return ds
Why wasn't it saved with the midpoint of time bounds in the first place?
I agree with you. This is a CESM issue and not a general issue, and it makes more sense to address it at the source.
Or we could just have a separate minimal package to handle the strange behaviors/inconsistencies of CESM output that are problematic when using xarray.
pydata/xarray#2922 could use some testing if someone's got the time
@dcherian, I took pydata/xarray#2922 for a test drive. From the results I am getting, it looks good to me: