georgebv/pyextremes

alternative to block_size

yngwaz opened this issue · 2 comments

First: Thanks for this package, I like it a lot (and it is much faster than my own implementation)!

My problem is ''block_size'':
If I am interested in annual maxima, it easily happens that the selected blocks traverse "hard boundaries" such as 31.12/01.01 (or 31.08/01.09 if I am interested in school years for instance). It could thus happen (rarely of course) that my annual maxima is attributed to the wrong year). This is of course related to the problem of leap years.

A worst-case scenario would probably be:
A maximum any time in e.g. 2020-12 and another "almost-maximum" at 2021-01-01 03:00. Then for the rest in 2021 no real high value (anywhere near the last two mentioned ones). It then could be that 2021-01-01 03:00 would be counted to the annual block of 2020 (and thus not appear in the extremes). However, I would really prefer to have the value of 2021-01-01 03:00 counted to 2021 and therefore provide another extreme value for 2021.

The date_time_intervals are constructed from my first element in the time series (ts) and the block_size. If I am aware of this (which I am), I can fill up my ts with 0's to my desired start of the year (not with nan's because they will be removed by pyextremes before building the date_time_intervals, and hence I would get even stranger year-periods).

My desired solution:
For my purpose, it would be nice to pass date_time_intervals to pyextremes (get_extremes_block_maxima) directly. This would allow me to have hard boundaries at years.

I could imagine this "problem" is even more severe if one looks at monthly blocks: an average block_size would constantly traverse hard month-boundaries.

Anyway, thanks again, and I would be interested to know if my block_size problem is worth to be considered.

I may be late, but, if you have a custom way you extract extreme values from your data, you can use the EVA.set_extremes method:

def set_extremes(self, extremes: pd.Series, **kwargs) -> None:
"""
Set extreme values.
This method is used to set extreme values onto the model instead
of deriving them from data directly using the 'get_extremes' method.
This way user can set extremes calculated using a custom methodology.
Parameters
----------
extremes : pd.Series
Time series of extreme values to be set onto the model.
Must be numeric, have date-time index, and have the same name
as self.data.
kwargs:
method : str, optional
Extreme value extraction method.
Supported values:
BM (default) - Block Maxima
POT - Peaks Over Threshold
extremes_type : str, optional
high (default) - extreme high values
low - extreme low values
if method is BM:
block_size : str or pandas.Timedelta, optional
Block size.
If None (default), then is calculated as median distance
between extreme events.
errors : str, optional
raise - raise an exception
when encountering a block with no data
ignore (default) - ignore blocks with no data
coerce - get extreme values for blocks with no data
as mean of all other extreme events in the series
with index being the middle point of corresponding interval
min_last_block : float, optional
Minimum data availability ratio (0 to 1) in the last block
for it to be used to extract extreme value from.
This is used to discard last block when it is too short.
If None (default), last block is always used.
if method is POT:
threshold : float, optional
Threshold used to find exceedances.
By default is taken as smallest value.
r : pandas.Timedelta or value convertible to timedelta, optional
Duration of window used to decluster the exceedances.
By default r='24H' (24 hours).
See pandas.to_timedelta for more information.
"""

This way you can extract extreme values yourself and then use them with pyextremes.

Many thanks for that hint!
Yes, that would work perfectly fine for me. Sorry, I didn't see this option!