Unidata/cftime

Switch some attributes over to dynamic calculation?

bradyrx opened this issue · 7 comments

I am moving a discussion from pangeo over to here: pangeo-data/pangeo#764. xarray has a method on their CFTimeIndex to shift the cftime.

E.g.,

import xarray as xr
times = xr.cftime_range('1950', '1952', freq='AS')
print(times)
>>> CFTimeIndex([1950-01-01 00:00:00, 1951-01-01 00:00:00, 1952-01-01 00:00:00], dtype='object')
print(times.shift(1, 'YS'))
>>> CFTimeIndex([1951-01-01 00:00:00, 1952-01-01 00:00:00, 1953-01-01 00:00:00], dtype='object')

I am using this heavily in my package to shift large CFTimeIndex's many times in succession. It's running fairly slow and we're finding that the bottleneck is most likely in cftime rather than xarray.CFTimeIndex.shift(). Please see @spencerkclark's timing comments toward the bottom of that issue discussion.

It looks like the calculation of dayofwk and dayofyr is the bottleneck here, since it's being computed on initialization of the object (

cftime/cftime/_cftime.pyx

Lines 1468 to 1473 in 9f7c576

if self.dayofwk < 0:
jd = JulianDayFromDate(self,calendar='365_day')
year,month,day,hour,mn,sec,ms,dayofwk,dayofyr =\
DateFromJulianDay(jd,return_tuple=True,calendar='365_day')
self.dayofwk = dayofwk
self.dayofyr = dayofyr
). @spencerkclark showed that there are huge speedups from commenting that out.

Would it be feasible to make these dynamic attributes? E.g. they are calculated as a property of the object only when requested by the user?

Details:

cftime version: 1.0.4.2

environment: Mac OSX, python 3.7.3

Can you try PR #160 (branch issue158) and see if speeds things up as you would expect?

This looks awesome, @jswhit! Thanks for the quick turnaround. @spencerkclark I think this nails the problem. Thoughts? We get ~200x speedup on cftime.replace, 400x speedup on cftime +datetime.timedelta, and 20x speedup on xr.CFTimeIndex.shift().

Current setup:

import xarray as xr
import cftime
import datetime
print(xr.__version__)
>>> 0.15.1
print(cftime.__version__)
>>> 1.1.1.1

date = cftime.DatetimeNoLeap(2000, 1, 1)
%timeit date.replace(year=2001)
>>> 222 µs ± 44.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

timedelta = datetime.timedelta(days=9)
%timeit date + timedelta
>>> 207 µs ± 36.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

times = xr.cftime_range('1950', '2000', freq='AS')
%timeit times.shift(1, 'YS')
>>> 14.3 ms ± 1.7 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)

With #160:

import xarray as xr
import cftime
import datetime
print(xr.__version__)
>>> 0.15.1
print(cftime.__version__)
>>> 1.1.1.1

date = cftime.DatetimeNoLeap(2000, 1, 1)
%timeit date.replace(year=2001)
>>> 1.15 µs ± 113 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

timedelta = datetime.timedelta(days=9)
%timeit date + timedelta
>>> 512 ns ± 41.7 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

times = xr.cftime_range('1950', '2000', freq='AS')
%timeit times.shift(1, 'YS')
>>> 796 µs ± 58.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

replace

timedelta

shift

Indeed this is excellent; clever to cache the results of these in #160 as well! Thanks @jswhit for implementing things and @bradyrx for starting the discussion. I'm glad we were able to work this out together.

Merged now - will be in the 1.1.2 release. Let me know if you need a release sooner rather than later. Does the xarray test suite still pass with this change?

Good question; one failing test does appear, which seems related to the changes:

___________________________________________ test_dayofyear_after_cftime_range[D] ____________________________________________

freq = 'D'

    @pytest.mark.parametrize("freq", ["A", "M", "D"])
    def test_dayofyear_after_cftime_range(freq):
        pytest.importorskip("cftime", minversion="1.0.2.1")
        result = cftime_range("2000-02-01", periods=3, freq=freq).dayofyear
        expected = pd.date_range("2000-02-01", periods=3, freq=freq).dayofyear
>       np.testing.assert_array_equal(result, expected)
E       AssertionError:
E       Arrays are not equal
E
E       Mismatch: 66.7%
E       Max absolute difference: 33
E       Max relative difference: 0.97058824
E        x: array([32,  1,  1])
E        y: array([32, 33, 34])

tests/test_cftime_offsets.py:1184: AssertionError

A minimal example of the issue would be:

In [1]: import cftime; import datetime

In [2]: date = cftime.DatetimeGregorian(2000, 2, 1)

In [3]: date.dayofyr
Out[3]: 32

In [4]: date_after_timedelta_addition = date + datetime.timedelta(days=1)

In [5]: date_after_timedelta_addition.dayofyr
Out[5]: 1

We would expect [5] to return 33.

#163 provides a fix for the issue above.

Merged #163