stan-dev/loo

loo_compare for crps and loo_crps

bnicenboim opened this issue · 10 comments

I just discovered crps! This is great, I was actually asking about how to compare probabilistic vs non-probabilistic predictions https://discourse.mc-stan.org/t/comparison-of-machine-learning-models-and-bayesian-models/30833 and this is exactly what I needed.

I just noticed that loo_compare and print are not implemented. If it's ok, I volunteer to implement it. I just want to know how would you call the predictions, "elpd" doesn't make much sense. Maybe "ps"? (predictive score)?

Seems like a good idea, thanks @bnicenboim. @avehtari what do you think?

Maybe "ps"? (predictive score)?

Maybe just "score"?

The problem is that the Brier score is kind of related and a higher score indicates a worst prediction

Good point. Let's see if @avehtari or any other loo developers (or users) have an opinion on the name.

I just want to know how would you call the predictions, "elpd" doesn't make much sense

Can you be more specific? Are you talking about what is printed or what are the names in the data structure storing the needed information. elpd is computed using predictions and data, so it's not a prediction either.

We have also loo_predictive_metric(), so it would be useful to think that at the same time. ((s)crps have their own functions as they need the extra argument).

When printing in loo_compare, it would be good to show e.g. crps_diff, rmse_diff etc.

The problem is that the Brier score is kind of related and a higher score
indicates a worst prediction. I think it's important to indicate that this
is a predictive score.

Yeah, it's annoying that Brier is in that direction, but Brier score can be a predictive score, too.

I just want to know how would you call the predictions, "elpd" doesn't make much sense

Can you be more specific? Are you talking about what is printed or what are the names in the data structure storing the needed information. elpd is computed using predictions and data, so it's not a prediction either.

Sorry, I was being too sloppy, I mean the following, this is the output when you print a loo object, and the structure of the object:

suppressMessages(library(loo))
suppressMessages(library(rstanarm))
data("kidiq")
fit <- stan_glm(kid_score ~ mom_hs + mom_iq, data = kidiq, refresh=0)
(l <- loo(fit))
#> 
#> Computed from 4000 by 434 log-likelihood matrix
#> 
#>          Estimate   SE
#> elpd_loo  -1876.2 14.2
#> p_loo         4.1  0.4
#> looic      3752.3 28.4
#> ------
#> Monte Carlo SE of elpd_loo is 0.0.
#> 
#> All Pareto k estimates are good (k < 0.5).
#> See help('pareto-k-diagnostic') for details.
l$pointwise |> head()
#>       elpd_loo mcse_elpd_loo       p_loo     looic influence_pareto_k
#> [1,] -5.696742  0.0026537646 0.034212237 11.393485         0.10213248
#> [2,] -4.208480  0.0009781805 0.004128595  8.416959        -0.02607779
#> [3,] -4.032512  0.0007344157 0.002412484  8.065024         0.02448749
#> [4,] -3.855000  0.0005129338 0.001242125  7.710000        -0.09151790
#> [5,] -5.290719  0.0019263523 0.016870922 10.581438         0.09749258
#> [6,] -4.027275  0.0012510816 0.006134031  8.054551         0.05941903

Created on 2023-04-04 with reprex v2.0.2

And this is for crps

suppressMessages(library(loo))
suppressMessages(library(rstanarm))
data("kidiq")
fit <- stan_glm(kid_score ~ mom_hs + mom_iq, data = kidiq, refresh=0)
ypred1 <- posterior_predict(fit)
ypred2 <- posterior_predict(fit)
(ps <- crps(ypred1, ypred2, y = fit$y))
#> $estimates
#>    Estimate          SE 
#> -10.2131439   0.3440731 
#> 
#> $pointwise
#>          1          2          3          4          5          6          7 
#> -25.441469  -9.469547  -7.159489  -4.931622 -21.550196  -7.035573 -30.444535 
#>          8          9         10         11         12         13         14 
#>  ...

I meant how to organize $pointwise as a matrix and how to call the estimates. If maybe ps, or score (I don't think it's a good idea), or just crps. I vote for ps, because then if someone (I can do it) implements rps, it will have the same name. Then loo_compare could show ps_diff.

We have also loo_predictive_metric(), so it would be useful to think that at the same time. ((s)crps have their own functions as they need the extra argument).

I also asked about this here #221, why do they need an extra argument, one could split the predictions array, and use half for the first argument and half for the second one:

If you just divide the predictions in two, the estimates are virtually the same:

(ps2 <- crps(ypred1[1:2000,], ypred1[2001:4000,], y = fit$y))

I think ps is too short. Why not use the actual acronym for each scrps, crps, rps? If it just says ps, how would you know which has been used? Furthermore, for consistency, there should be _loo ending. Like this:

#> 
#> Computed from 4000 by 434 log-likelihood matrix
#> 
#>          Estimate   SE
#> scrps_loo  -1876.2 14.2
#> ------
#> Monte Carlo SE of scrps_loo is 0.0.
#> 
#> All Pareto k estimates are good (k < 0.5).
#> See help('pareto-k-diagnostic') for details.

#>       scrps_loo mcse_scrps_loo influence_pareto_k
#> [1,] -5.696742  0.0026537646 0.10213248
#> [2,] -4.208480  0.0009781805 -0.02607779
#> [3,] -4.032512  0.0007344157 0.02448749
#> [4,] -3.855000  0.0005129338 -0.09151790
#> [5,] -5.290719  0.0019263523 0.09749258
#> [6,] -4.027275  0.0012510816 0.05941903

This would also allow that that matrix could include both columns scrps_loo and crps_loo.

I also asked about this here #221, why do they need an extra argument, one could split the predictions array, and use half for the first argument and half for the second one:

If you just divide the predictions in two, the estimates are virtually the same:

I think it would be clearer to have this discussed in a separate issue (and tag @LeeviLindgren). I don't expect there is much difference in the easy case, but if the variability of the importance weights is big then a) the importance weights are different for the two halves (while in the current implementation they are the same), b) the number of draws for the importance weighting is halved, which increases Monte Carlo error. It is possible that in practice a + b are not a problem, but this should be investigated more thoroughly.

I think ps is too short. Why not use the actual acronym for each scrps, crps, rps? If it just says ps, how would you know which has been used?

ok, sure, it makes sense

Furthermore, for consistency, there should be _loo ending.

But what if one uses crps, it won't (necessarily) be leave one out, it could be kfold or held out data.

And also this: Computed from 4000 by 434 log-likelihood matrix is only relevant for the loo_crps/ loo_scrps, right? Should I just leave it empty in the other cases?

But what if one uses crps, it won't (necessarily) be leave one out, it could be kfold or held out data.

For kfold and held out data, we have https://mc-stan.org/loo/articles/loo2-elpd.html, and could follow that or do complete redesign.

And also this: Computed from 4000 by 434 log-likelihood matrix is only relevant for the loo_crps/ loo_scrps, right? Should I just leave it empty in the other cases?

For non-loo, it could still report the size of the yrep and y

@bnicenboim should we have a video call to discuss this?

I'm sorry, as I mentioned in the other issue, I won't have time until mid January (hopefully).