ENH: Make it possible to gives percentiles in independant file(s) in place of computing them on base_period_time_range

Question

ENH: Make it possible to gives percentiles in independant file(s) in place of computing them on base_period_time_range

bzah opened this issue 2 years ago · 1 comments

icclim version: 5.2
Python version: n/a

Description

To avoid computing the same percentiles multiple times, it would be nice if icclim could accept input files of percentiles.
For example, to compute Cold and Dry (needs pr and tas), the input could look like:

in_files = {
     "tas": {"study": "tas-store.zarr", "percentiles": ["tas-per-1.nc", "tas-per-2.nc"]},
     "pr":    {"study":  xr.open_dataset("pr.nc").pr, , "percentiles": "pr-per.nc"}
 }

Where:

infiles can be a dictionary where its keys replace var_name
when the values are not dictionaries, they must be valid inputs (netCDF(s), Zarr, xr.Dataset, xr.DataArray)
when the values are dictionaries they must have a "study" key and may have a "percentiles" key.
the values of linked to "study" and "percentiles' must be valid inputs.

Additionally, with an in_files dictionary:

var_name must be empty
save_percentile cannot be used
base_period_time_range cannot be used (or maybe it should be an argument of the dictionary in-place of "percentiles")
only_leap_years cannot be used (same as above)
window_width cannot be used (same as above)
interpolation cannot be used (same as above)

From xclim p.o.v it's quite simple to handle a different source for percentiles because percentile_doy is already a separate function and not within the index call. For indices using time-series percentiles (opposed to dayofyear percentiles) the same logic exist but with calc_perc instead.

A few notes:

With percentiles given as input, it will be harder to ensure the proper application of the ECAD definitions:

dayofyear percentiles must be used on t(g|x|n)(10|90)p indices.
timeseries percentiles must be used for r(75|95|99)p and r(75|95|99)pTOT indices.

We could make use of this feature to significantly improve performances of icclim.indices("all" ...)
If we use save_percentile=True on indices sharing percentiles (like TX90p and WSDI) we can compute them once and reuse them afterward.
However, this may not work if bootstrapping must be run (see below).
With user inputted percentiles, should we run bootstrapping or not ?
For example, if the user compute percentiles and save then in a file using icclim.tx90p(save_percentile=True).
The bootstrapping algorithm may have run however, the saved percentiles are NOT the ones used in bootstrapping but only the one use in the out of base comparison.
In a second step if the user reuse theses percentiles to compute, either tx90p on another period or wsdi on the same period,
should we try to detect that and run bootstrapping ? (ping @pagecp)
Another approach could be to extract bootstrapped percentiles and make them reusable which would significantly improve perfs (that's not a easy task).

Notes

This issue has been raised following Météo France meeting.
Some work is already in progress in branch enh/percentile_as_input

Answer 1 · 2022-06-09T09:38:33.000Z

Regrading bootstrapping and saving percentiles, I think we should, for now, restrict the use of saved percentiles (either saving or loading) to cases without bootstrapping.