pangeo-data/climpred

Logic to determine frequency of verification dataset fails when monthly data times are on different days of the month

Opened this issue · 1 comments

Description of bug
For monthly average data, it is not uncommon for time indices to be on the middle day of the month, which varies from month the month. This breaks the logic in return_time_series_freq, which only picks out a monthly frequency if the time index day is the same for each month.

I encountered the issue while attempting to generate an uninitialized forecast. I think it was likewise causing silent issues in generating a persistence forecast, which was previously producing NaNs but works fine after implementing a hack (changing the time index of the verification dataset to match what's expected).

Code sample (reproducing the core logic of return_time_series_freq)

import cftime

# monthly separated time array
times = [cftime.DatetimeNoLeap(1,1,15),cftime.DatetimeNoLeap(1,2,14),cftime.DatetimeNoLeap(1,3,15)]
ds = xr.Dataset(coords={'time':times})

for freq in ['day','month','year']:
        # first dim values not equal all others
        if not (
            getattr(ds.isel({'time': 0})['time'].dt, freq) == getattr(ds['time'].dt, freq)
        ).all():
            print(freq)
            break

This returns a frequency of "day", which results in subsequent errors. To work around this, a user has to manipulate at least the verification dataset to have the same "day" for each month in the time index.

Would it be undesirable for the frequency of the verification dataset to be user specified in the same way as the units of the initialized dataset lead time need to be specified?

Output of climpred.show_versions()

INSTALLED VERSIONS
------------------
commit: None
python: 3.11.9 | packaged by conda-forge | (main, Apr 19 2024, 18:36:13) [GCC 12.3.0]
python-bits: 64
OS: Linux
OS-release: 4.18.0-553.5.1.el8_10.x86_64
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

climpred: 2.4.0
xarray: 2023.12.0
pandas: 2.2.2
numpy: 1.26.4
scipy: 1.13.1
cftime: 1.6.3
netcdf4: None
nc_time_axis: 1.4.1
matplotlib: 3.8.2
cf_xarray: 0.9.2
xclim: 0.50.0
dask: 2024.5.0
distributed: 2024.5.0
setuptools: 69.5.1
pip: 24.0
conda: None
IPython: 8.25.0
sphinx: None

Usually I went for "changing the time index of the verification dataset to match what's expected" ie fixing before using climpred. Mostly going for beginning of the month to just have 1s.

Not sure how difficult a change would be to implement but feel free.