AI4S2S/s2spy

Resampling functions not working for calendars allow overlapping

geek-yang opened this issue · 0 comments

Currently if we create a calendar with overlapping intervals and try to resample data with this calendar, for instance error:

import s2spy.time
# define calendar
calendar = s2spy.time.AdventCalendar(anchor=(11, 30), freq='90d', n_targets=1)
calendar.set_max_lag(5, allow_overlap=True)
# example timeseries
time_index = pd.date_range('20171020', '20211001', freq='15d')
random_data = np.random.random(len(time_index))
example_series = pd.Series(random_data, index=time_index)
# map calendar to data
calendar = calendar.map_to_data(example_series)
# resample data
resampled_series = s2spy.time.resample(calendar, example_series)

it will trigger the following error:

---------------------------------------------------------------------------
InvalidIndexError                         Traceback (most recent call last)
/home/yangliu/AI4S2S/s2spy/notebooks/tutorial_time.ipynb Cell 20 in <cell line: 1>()
----> [1](vscode-notebook-cell://wsl%2Bubuntu-20.04/home/yangliu/AI4S2S/s2spy/notebooks/tutorial_time.ipynb#X31sdnNjb2RlLXJlbW90ZQ%3D%3D?line=0) resampled_series = s2spy.time.resample(calendar, example_series)
      [2](vscode-notebook-cell://wsl%2Bubuntu-20.04/home/yangliu/AI4S2S/s2spy/notebooks/tutorial_time.ipynb#X31sdnNjb2RlLXJlbW90ZQ%3D%3D?line=1) resampled_series

File ~/AI4S2S/s2spy/s2spy/_resample.py:229, in resample(mapped_calendar, input_data)
    225 #utils.check_timeseries(input_data)
    226 #utils.check_input_frequency(mapped_calendar, input_data)
    228 if isinstance(input_data, PandasData):
--> 229     resampled_data = resample_pandas(mapped_calendar, input_data)
    230 else:
    231     resampled_data = resample_xarray(mapped_calendar, input_data)

File ~/AI4S2S/s2spy/s2spy/_resample.py:96, in resample_pandas(calendar, input_data)
     93 bins = resample_bins_constructor(calendar.get_intervals())
     95 interval_index = pd.IntervalIndex(bins["interval"])
---> 96 interval_groups = interval_index.get_indexer(input_data.index)
     97 # except Exception as error:
     98 #     print(error)
     99 # finally:
    100 #     interval_groups, _ = interval_index.get_indexer_non_unique(input_data.index)
    101 interval_means = input_data.groupby(interval_groups).mean()

File ~/miniconda3/lib/python3.8/site-packages/pandas/core/indexes/base.py:3721, in Index.get_indexer(self, target, method, limit, tolerance)
...
-> 3721     raise InvalidIndexError(self._requires_unique_msg)
   3723 if len(target) == 0:
   3724     return np.array([], dtype=np.intp)

InvalidIndexError: cannot handle overlapping indices; use IntervalIndex.get_indexer_non_unique

I notice that we use pd.IntervalIndex.get_indexer() to index the example data based on the intervals and then call groupby().mean() function to perform the resampling. However, pd.IntervalIndex.get_indexer() can not handle overlapped intervals since one data point in the example data can not have two labels.

This error message suggests to use the method IntervalIndex.get_indexer_non_unique but it also does not work. And the index array generated by IntervalIndex.get_indexer_non_unique is very weird. The documentation is poorly written. I would like to further investigate this. It seems to be a chance to contribute to pandas. And in the meantime I will address this issue.