Switch some attributes over to dynamic calculation?
bradyrx opened this issue · 7 comments
I am moving a discussion from pangeo
over to here: pangeo-data/pangeo#764. xarray
has a method on their CFTimeIndex
to shift the cftime
.
E.g.,
import xarray as xr
times = xr.cftime_range('1950', '1952', freq='AS')
print(times)
>>> CFTimeIndex([1950-01-01 00:00:00, 1951-01-01 00:00:00, 1952-01-01 00:00:00], dtype='object')
print(times.shift(1, 'YS'))
>>> CFTimeIndex([1951-01-01 00:00:00, 1952-01-01 00:00:00, 1953-01-01 00:00:00], dtype='object')
I am using this heavily in my package to shift large CFTimeIndex
's many times in succession. It's running fairly slow and we're finding that the bottleneck is most likely in cftime
rather than xarray.CFTimeIndex.shift()
. Please see @spencerkclark's timing comments toward the bottom of that issue discussion.
It looks like the calculation of dayofwk
and dayofyr
is the bottleneck here, since it's being computed on initialization of the object (
Lines 1468 to 1473 in 9f7c576
Would it be feasible to make these dynamic attributes? E.g. they are calculated as a property of the object only when requested by the user?
Details:
cftime
version: 1.0.4.2
environment: Mac OSX, python 3.7.3
Can you try PR #160 (branch issue158
) and see if speeds things up as you would expect?
This looks awesome, @jswhit! Thanks for the quick turnaround. @spencerkclark I think this nails the problem. Thoughts? We get ~200x speedup on cftime.replace
, 400x speedup on cftime +datetime.timedelta
, and 20x speedup on xr.CFTimeIndex.shift()
.
Current setup:
import xarray as xr
import cftime
import datetime
print(xr.__version__)
>>> 0.15.1
print(cftime.__version__)
>>> 1.1.1.1
date = cftime.DatetimeNoLeap(2000, 1, 1)
%timeit date.replace(year=2001)
>>> 222 µs ± 44.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
timedelta = datetime.timedelta(days=9)
%timeit date + timedelta
>>> 207 µs ± 36.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
times = xr.cftime_range('1950', '2000', freq='AS')
%timeit times.shift(1, 'YS')
>>> 14.3 ms ± 1.7 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
With #160:
import xarray as xr
import cftime
import datetime
print(xr.__version__)
>>> 0.15.1
print(cftime.__version__)
>>> 1.1.1.1
date = cftime.DatetimeNoLeap(2000, 1, 1)
%timeit date.replace(year=2001)
>>> 1.15 µs ± 113 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
timedelta = datetime.timedelta(days=9)
%timeit date + timedelta
>>> 512 ns ± 41.7 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
times = xr.cftime_range('1950', '2000', freq='AS')
%timeit times.shift(1, 'YS')
>>> 796 µs ± 58.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Merged now - will be in the 1.1.2 release. Let me know if you need a release sooner rather than later. Does the xarray test suite still pass with this change?
Good question; one failing test does appear, which seems related to the changes:
___________________________________________ test_dayofyear_after_cftime_range[D] ____________________________________________
freq = 'D'
@pytest.mark.parametrize("freq", ["A", "M", "D"])
def test_dayofyear_after_cftime_range(freq):
pytest.importorskip("cftime", minversion="1.0.2.1")
result = cftime_range("2000-02-01", periods=3, freq=freq).dayofyear
expected = pd.date_range("2000-02-01", periods=3, freq=freq).dayofyear
> np.testing.assert_array_equal(result, expected)
E AssertionError:
E Arrays are not equal
E
E Mismatch: 66.7%
E Max absolute difference: 33
E Max relative difference: 0.97058824
E x: array([32, 1, 1])
E y: array([32, 33, 34])
tests/test_cftime_offsets.py:1184: AssertionError
A minimal example of the issue would be:
In [1]: import cftime; import datetime
In [2]: date = cftime.DatetimeGregorian(2000, 2, 1)
In [3]: date.dayofyr
Out[3]: 32
In [4]: date_after_timedelta_addition = date + datetime.timedelta(days=1)
In [5]: date_after_timedelta_addition.dayofyr
Out[5]: 1
We would expect [5] to return 33.
#163 provides a fix for the issue above.