tidymodels/spatialsample

There should be a standard way to set `v` to `max_v`

Closed this issue · 4 comments

Feature

spatialsample uses a different check_v than rsample, because it's often the case that setting v to max_v makes statistical sense (in situations like leave-one-block-out CV, leave-one-location-out CV, and so on). However, it can be difficult to know in advance what v should be for these circumstances, for instance to know how many blocks will contain data in the process of running spatial_block_cv. Right now, my general approach is to set v = Inf, which works with a warning:

library(spatialsample)

spatial_block_cv(boston_canopy, v = Inf)
#> Warning in spatial_block_cv(boston_canopy, v = Inf): Fewer than Inf blocks available for sampling
#> ℹ Setting `v` to 61
#> #  61-fold spatial block cross-validation 
#> # A tibble: 61 × 2
#>    splits           id    
#>    <list>           <chr> 
#>  1 <split [665/17]> Fold01
#>  2 <split [679/3]>  Fold02
#>  3 <split [666/16]> Fold03
#>  4 <split [665/17]> Fold04
#>  5 <split [674/8]>  Fold05
#>  6 <split [671/11]> Fold06
#>  7 <split [676/6]>  Fold07
#>  8 <split [679/3]>  Fold08
#>  9 <split [665/17]> Fold09
#> 10 <split [671/11]> Fold10
#> # … with 51 more rows

Created on 2022-06-22 by the reprex package (v2.0.1)

I think this makes good, intuitive sense, and gives the expected result: we're setting v higher than the maximum acceptable number of folds and getting max_v instead, and by setting v to infinity we can be sure it's higher than that maximum. I think we shouldn't warn when is.infinite(v) and document this as the supported way to perform leave-one-something-out CV.

An alternative would be to use NULL for this special case, which is how rsample does it in group vfold and might be a more normal "special case" value than Inf, which I bet a good chunk of users are barely aware of in the first place. Right now this errors:

library(spatialsample)

spatial_block_cv(boston_canopy, v = NULL)
#> Error in `spatial_block_cv()`:
#> ! `v` must be a single positive integer.

Created on 2022-06-22 by the reprex package (v2.0.1)

NULL feels less graceful to me for this purpose than Inf, but might be more idiomatic.

We also special-case NULL already for spatial_leave_location_out_cv, because it just wraps group_vfold_cv and so does what rsample does.

Another possibility is to special-case both NULL and Inf, which I think I'd personally prefer over just NULL (just because Inf feels so graceful). Not a strong preference, but definitely a preference.

I like your idea of special-casing both NULL and Inf. (I do think that NULL will be more familiar to folks.) I'd also do a check through the examples and (if possible due to time constraints on the check) show using this.

This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex: https://reprex.tidyverse.org) and link to this issue.