Incorporate CRPS explanation into documentation
lboeman opened this issue · 0 comments
During the American Made Challenges Solar Forecasting Competition, users had questions about how we handle CRPS using our descrete quantiles. @dplarson provided the following helpful explanation which should be incorporated into the documentation:
The Solar Forecast Arbiter (SFA) uses numerical integration to approximate the integral term in the CRPS definition but does not interpolate the forecast CDF (since the SFA makes no assumptions regarding how the forecasts were generated). There are tradeoffs with this approach, but that is true of any CRPS formulation where you don’t assume the forecast is defined by a specific continuous parametric function (and therefore must approximate the CRPS numerically). The SFA documentation (https://solarforecastarbiter-core.readthedocs.io/en/latest/generated/solarforecastarbiter.metrics.probabilistic.continuous_ranked_probability_score.html) and source code (https://solarforecastarbiter-core.readthedocs.io/en/latest/_modules/solarforecastarbiter/metrics/probabilistic.html#continuous_ranked_probability_score) have more details.
Working through a CRPS calculation by hand is tedious, but an example to illustrate: Let’s assume the target variable is GHI [W/m^2] and the probabilistic forecast is given as:
Prob(Power <= 42 W/m^2) = 0%
Prob(Power <= 408 W/m^2) = 10%
Prob(Power <= 474 W/m^2) = 20%
Prob(Power <= 521 W/m^2) = 30%
Prob(Power <= 562 W/m^2) = 40%
Prob(Power <= 600 W/m^2) = 50%
Prob(Power <= 638 W/m^2) = 60%
Prob(Power <= 679 W/m^2) = 70%
Prob(Power <= 726 W/m^2) = 80%
Prob(Power <= 792 W/m^2) = 90%
Prob(Power <= 1158 W/m^2) = 100%
which is approximately the CDF of a Gaussian (aka normal) distribution with mean = 600 W/m^2 and standard deviation = 150 W/m^2.
The steps to calculate the CRPS for a single forecast timestamp is:
calculate the indicator function term: 1{x >= y}
= 1 if x >= y, else 0 (where y is the observation)
calculate the integrand term: ( F(x) – 1{x >= y} )^2
integrate along x
If the observation is 710 W/m^2, then in vector form we have:
x = [42, 408, 474, 521, 562, 600, 638, 679, 726, 792, 1158] # the forecasted GHI [W/m^2]
F(x) = [0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0] # the forecasted probability [-]
y = 710 # the observed GHI [W/m^2]
Then the calculation is:
indicator function term: 1{x >= y}
= [0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1] # only {726, 792, 1158} are >= 710 W/m^2 (the {80%, 90%, 100%} forecasts)
the difference between the forecast probability and the indicator function: F(x) – 1{x >= y}
= [0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, -0.2, -0.1, 0.0]
the integrand term: ( F(x) – 1{x >= y} )^2
= [0.0, 0.01, 0.04, 0.09, 0.16, 0.25, 0.36, 0.49, 0.04, 0.01, 0.0]
integrate along x (using trapezoid rule): CRPS = 64.4 W/m^2
You can follow the same approach for other observations:
same forecasts, but the observation is 450 W/m^2: CRPS = 98.5 W/m^2
same forecasts, but the observation is 1000 W/m^2: CRPS = 271.1 W/m^2
Note: in this example, we know the forecast is from a Gaussian distribution and the CRPS of a Gaussian can be calculated analytically [1]. If the observation is 710 W/m^2, then the analytical calculation of the CRPS is 65.9 W/m^2 vs the SFA CRPS of 64.4 W/m^2. The difference here is due to the limits of numerical integration. For example, increasing the probability resolution results in the SFA CRPS converging to the analytically calculated CRPS (65.9 W/m^2):
forecast every 10% (i.e., 0%, 10%, …, 90%, 100%): CRPS = 64.4 W/m^2
forecast every 1% (i.e., 0%, 1%, …, 99%, 100%): CRPS = 65.0 W/m^2
forecast every 0.1% (i.e., 0%, 0.1%, …, 99.9%, 100%): CRPS = 65.9 W/m^2
[1] Gneiting et al. (2005) “Calibrated Probabilistic Forecasting Using Ensemble Model Output Statistics and Minimum CRPS Estimation”, doi: https://doi.org/10.1175/MWR2904.1