There should be a standard way to set `v` to `max_v`
Closed this issue · 4 comments
Feature
spatialsample uses a different check_v than rsample, because it's often the case that setting v to max_v makes statistical sense (in situations like leave-one-block-out CV, leave-one-location-out CV, and so on). However, it can be difficult to know in advance what v
should be for these circumstances, for instance to know how many blocks will contain data in the process of running spatial_block_cv. Right now, my general approach is to set v = Inf
, which works with a warning:
library(spatialsample)
spatial_block_cv(boston_canopy, v = Inf)
#> Warning in spatial_block_cv(boston_canopy, v = Inf): Fewer than Inf blocks available for sampling
#> ℹ Setting `v` to 61
#> # 61-fold spatial block cross-validation
#> # A tibble: 61 × 2
#> splits id
#> <list> <chr>
#> 1 <split [665/17]> Fold01
#> 2 <split [679/3]> Fold02
#> 3 <split [666/16]> Fold03
#> 4 <split [665/17]> Fold04
#> 5 <split [674/8]> Fold05
#> 6 <split [671/11]> Fold06
#> 7 <split [676/6]> Fold07
#> 8 <split [679/3]> Fold08
#> 9 <split [665/17]> Fold09
#> 10 <split [671/11]> Fold10
#> # … with 51 more rows
Created on 2022-06-22 by the reprex package (v2.0.1)
I think this makes good, intuitive sense, and gives the expected result: we're setting v
higher than the maximum acceptable number of folds and getting max_v
instead, and by setting v
to infinity we can be sure it's higher than that maximum. I think we shouldn't warn when is.infinite(v)
and document this as the supported way to perform leave-one-something-out CV.
An alternative would be to use NULL
for this special case, which is how rsample does it in group vfold and might be a more normal "special case" value than Inf, which I bet a good chunk of users are barely aware of in the first place. Right now this errors:
library(spatialsample)
spatial_block_cv(boston_canopy, v = NULL)
#> Error in `spatial_block_cv()`:
#> ! `v` must be a single positive integer.
Created on 2022-06-22 by the reprex package (v2.0.1)
NULL
feels less graceful to me for this purpose than Inf
, but might be more idiomatic.
We also special-case NULL
already for spatial_leave_location_out_cv
, because it just wraps group_vfold_cv
and so does what rsample does.
Another possibility is to special-case both NULL
and Inf
, which I think I'd personally prefer over just NULL
(just because Inf
feels so graceful). Not a strong preference, but definitely a preference.
I like your idea of special-casing both NULL
and Inf
. (I do think that NULL
will be more familiar to folks.) I'd also do a check through the examples and (if possible due to time constraints on the check) show using this.
This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex: https://reprex.tidyverse.org) and link to this issue.