tidymodels/spatialsample

Error while executing a random forest regression using spatial sampling

Closed this issue · 5 comments

I want to perform a random forest (RF) regression to predict a continuous outcome. I'm using spatialsample to partition the data set (I am aware of the issue #140). Whenever I try to perform the regression (using default parameters for the RF) I am getting this error:

Error in `vctrs::vec_slice()`:
! Can't subset elements with `i`.
✖ Logical subscript `i` must be size 1 or 6, not 7.
Run `rlang::last_trace()` to see where the error occurred.
Warning message:
In model_env_matches$mode == mode :
  longer object length is not a multiple of shorter object length

I am not sure what the error means and how to solve it. Below is the code:

library(ggplot2)
library(spatialsample)
library(tidymodels)
library(textrecipes)

wd <- "path/"

# Projected reference system
provoliko <- "EPSG:24313"

df <- read.csv(paste0(wd, 'block.data.csv'))

eq1 <- ntl ~ pop + agbh + nir + ebbi + ndbi + road + pan + tirs

ames_sf <- sf::st_as_sf(df, coords = c("x", "y"), crs = provoliko)

set.seed(1234)
folds <- spatial_block_cv(ames_sf, v = 20)

recipe <- recipes::recipe(eq1, data = sf::st_drop_geometry(ames_sf))

# modeling
drought_result <- workflow(recipe %>%  
                             rand_forest() %>%
                             set_mode("regression") %>%
                             set_engine("randomForest") %>% 
                             tune::fit_resamples(folds, 
                                                 control = control_resamples(save_pred = TRUE))) # the error

Here is a subset of the sf object I am using:

A tibble: 10 × 11
     ntl   pop     agbh   nir     ebbi     ndbi  road    pan    nbai  tirs           geometry
   <dbl> <dbl>    <dbl> <dbl>    <dbl>    <dbl> <dbl>  <dbl>   <dbl> <dbl>        <POINT [m]>
 1  5.87  17.0 0.00230  0.300 -0.201   -0.243    48.1 0.108  -0.204   33.1 (435310.3 3470977)
 2 17.8   77.8 0.729    0.242  0.0585   0.0630  213.  0.172  -0.0423  40.3 (446230.3 3482737)
 3  7.06  34.6 0.0128   0.256 -0.152   -0.197     0   0.116  -0.164   34.3 (443710.3 3498277)
 4 29.9  136.  3.33     0.148  0.0150   0.0271  233.  0.123  -0.0427  40.4 (440350.3 3497017)
 5 12.0   33.8 1.01     0.244  0.00413  0.00466  23.1 0.152  -0.0747  37.1 (432370.3 3470557)
 6 40.4   86.4 4.26     0.168  0.00922  0.0167  177.  0.129  -0.0528  38.1 (437830.3 3493657)
 7  2.58  18.2 0.0795   0.265 -0.160   -0.212     0   0.110  -0.180   31.9 (452530.3 3497857)
 8  2.00  14.9 0.000419 0.332 -0.287   -0.333    53.3 0.0959 -0.264   27.1 (453790.3 3477697)
 9  6.74  20.8 0.0481   0.256  0.0657   0.0669   77.5 0.161  -0.0637  40.2 (451690.3 3480637)
10 26.8   81.4 1.87     0.238 -0.0379  -0.0415  142.  0.127  -0.115   38.3 (438250.3 3482737)

R 4.3.1 RStudio 2023.09.0 Build 463, Windows 10.

Howdy @nikosGeography !

Can you try building your folds with a regular data frame, instead of an sf object? For example:

folds <- spatial_block_cv(as.data.frame(sf::st_drop_geometry(ames_sf)), v = 20)

I'm pretty sure that should work. If it doesn't, can you upload (some of?) your data to this issue so I can take a look tomorrow? I don't want to type that data frame out 😆 your dput() output from last time was also useful, if that's easier.

It didn't work. I received an error:

Error in `spatial_block_cv()`:
! `spatial_block_cv()` currently only supports `sf` objects.
ℹ Try converting `data` to an `sf` object via `sf::st_as_sf()`.

If you could take a look tomorrow, that would be nice. Here is a data.frame with 10 rows:

structure(list(ntl = c(14.118185043335, 15.0372104644775, 24.6128005981445, 
33.2751502990723, 3.09464168548584, 24.6151103973389, 11.3328771591187, 
25.8314075469971, 26.6275043487549, 50.9468421936035), pop = c(87.3973999023438, 
89.5049591064453, 69.193359375, 113.711647033691, 10.6829500198364, 
91.0826187133789, 76.8622131347656, 86.4574127197266, 97.2555160522461, 
144.035659790039), agbh = c(0.505127429962158, 3.07101631164551, 
3.57691407203674, 2.53673243522644, 0.00112199422437698, 5.68933582305908, 
0.250360578298569, 3.60711622238159, 4.05604648590088, 4.6323823928833
), nir = c(0.229171246290207, 0.194354802370071, 0.17390401661396, 
0.229749098420143, 0.272861808538437, 0.181734174489975, 0.257997244596481, 
0.195779532194138, 0.195880636572838, 0.191920325160027), ebbi = c(-0.0958575084805489, 
-0.0110349524766207, 0.00237648747861385, -0.0257017724215984, 
-0.0904427915811539, 0.0401351079344749, -0.145487412810326, 
-0.00527992565184832, 0.013465441763401, 0.030767060816288), 
    ndbi = c(-0.133083552122116, -0.0179156363010406, 0.00375741114839911, 
    -0.0328953787684441, -0.115226343274117, 0.0590021200478077, 
    -0.184676945209503, -0.00770473340526223, 0.0193305220454931, 
    0.045164693146944), road = c(251.6875, 83.7782592773438, 
    487.718627929688, 125.576850891113, 14.2957067489624, 540.215454101562, 
    6.76938247680664, 463.005859375, 326.509704589844, 399.547058105469
    ), pan = c(0.114391185343266, 0.131493717432022, 0.133256837725639, 
    0.135219603776932, 0.131857812404633, 0.150442510843277, 
    0.11356084048748, 0.136978164315224, 0.143271997570992, 0.150269523262978
    ), nbai = c(-0.140348717570305, -0.0788049399852753, -0.0648676380515099, 
    -0.098626047372818, -0.122976809740067, -0.0357712768018246, 
    -0.161113634705544, -0.0757625624537468, -0.0654538571834564, 
    -0.0543252900242805), tirs = c(34.5458335876465, 35.3345603942871, 
    36.5489959716797, 36.8556747436523, 35.1384429931641, 36.5604209899902, 
    33.3936080932617, 36.6631050109863, 37.0490875244141, 37.2179985046387
    ), geometry = structure(list(structure(c(446230.3092, 3490716.9101
    ), class = c("XY", "POINT", "sfg")), structure(c(448750.3092, 
    3493656.9101), class = c("XY", "POINT", "sfg")), structure(c(434470.3092, 
    3479796.9101), class = c("XY", "POINT", "sfg")), structure(c(440770.3092, 
    3491136.9101), class = c("XY", "POINT", "sfg")), structure(c(436570.3092, 
    3470556.9101), class = c("XY", "POINT", "sfg")), structure(c(439510.3092, 
    3478536.9101), class = c("XY", "POINT", "sfg")), structure(c(426490.3092, 
    3482736.9101), class = c("XY", "POINT", "sfg")), structure(c(429430.3092, 
    3477276.9101), class = c("XY", "POINT", "sfg")), structure(c(434890.3092, 
    3479376.9101), class = c("XY", "POINT", "sfg")), structure(c(430270.3092, 
    3481476.9101), class = c("XY", "POINT", "sfg"))), class = c("sfc_POINT", 
    "sfc"), precision = 0, bbox = structure(c(xmin = 426490.3092, 
    ymin = 3470556.9101, xmax = 448750.3092, ymax = 3493656.9101
    ), class = "bbox"), crs = structure(list(input = "EPSG:24313", 
        wkt = "PROJCRS[\"Kalianpur 1962 / UTM zone 43N\",\n    BASEGEOGCRS[\"Kalianpur 1962\",\n        DATUM[\"Kalianpur 1962\",\n            ELLIPSOID[\"Everest 1830 (1962 Definition)\",6377301.243,300.8017255,\n                LENGTHUNIT[\"metre\",1]]],\n        PRIMEM[\"Greenwich\",0,\n            ANGLEUNIT[\"degree\",0.0174532925199433]],\n        ID[\"EPSG\",4145]],\n    CONVERSION[\"UTM zone 43N\",\n        METHOD[\"Transverse Mercator\",\n            ID[\"EPSG\",9807]],\n        PARAMETER[\"Latitude of natural origin\",0,\n            ANGLEUNIT[\"degree\",0.0174532925199433],\n            ID[\"EPSG\",8801]],\n        PARAMETER[\"Longitude of natural origin\",75,\n            ANGLEUNIT[\"degree\",0.0174532925199433],\n            ID[\"EPSG\",8802]],\n        PARAMETER[\"Scale factor at natural origin\",0.9996,\n            SCALEUNIT[\"unity\",1],\n            ID[\"EPSG\",8805]],\n        PARAMETER[\"False easting\",500000,\n            LENGTHUNIT[\"metre\",1],\n            ID[\"EPSG\",8806]],\n        PARAMETER[\"False northing\",0,\n            LENGTHUNIT[\"metre\",1],\n            ID[\"EPSG\",8807]]],\n    CS[Cartesian,2],\n        AXIS[\"(E)\",east,\n            ORDER[1],\n            LENGTHUNIT[\"metre\",1]],\n        AXIS[\"(N)\",north,\n            ORDER[2],\n            LENGTHUNIT[\"metre\",1]],\n    USAGE[\n        SCOPE[\"Engineering survey, topographic mapping.\"],\n        AREA[\"Pakistan - east of 72°E.\"],\n        BBOX[28.21,72,37.07,77.83]],\n    ID[\"EPSG\",24313]]"), class = "crs"), n_empty = 0L)), row.names = c(NA, 
10L), sf_column = "geometry", agr = structure(c(ntl = NA_integer_, 
pop = NA_integer_, agbh = NA_integer_, nir = NA_integer_, ebbi = NA_integer_, 
ndbi = NA_integer_, road = NA_integer_, pan = NA_integer_, nbai = NA_integer_, 
tirs = NA_integer_), class = "factor", levels = c("constant", 
"aggregate", "identity")), class = c("sf", "data.frame"))

So the core issue here is in how you've got that workflow() call set up. You might find https://www.tidymodels.org/start/tuning/ useful for how to work with this function and package.

The first issue is that rand_forest()'s first argument is mode, which takes a character string specifying whether you're doing regression or classification. It doesn't take a recipe, and so piping your recipe into the random forest errors:

library(tidymodels)
recipe <- recipes::recipe(Sale_Price ~ ., modeldata::ames)
recipe %>%  
  rand_forest()
#> Warning in model_env_matches$mode == mode: longer object length is not a
#> multiple of shorter object length
#> Error in `vctrs::vec_slice()`:
#> ! Can't subset elements with `i`.
#> ✖ Logical subscript `i` must be size 1 or 6, not 7.
#> Backtrace:
#>     ▆
#>  1. ├─recipe %>% rand_forest()
#>  2. ├─parsnip::rand_forest(.)
#>  3. │ └─parsnip::new_model_spec(...)
#>  4. │   └─parsnip::spec_is_possible(spec = out)
#>  5. │     └─vctrs::vec_slice(...)
#>  6. └─vctrs (local) `<fn>`()
#>  7.   └─vctrs:::stop_indicator_size(...)
#>  8.     └─rlang::cnd_signal(...)

Created on 2023-10-04 with reprex v2.0.2

The next few pipes are fine -- you can set up a model specification using rand_forest() %>% set_mode() %>% set_engine() without an error:

library(tidymodels)
rand_forest() |> 
  set_mode("regression") |> 
  set_engine("randomForest")
#> Random Forest Model Specification (regression)
#> 
#> Computational engine: randomForest

Created on 2023-10-04 with reprex v2.0.2

But then the arguments to fit_resamples() depend upon if you want to use a workflow or a normal model specification. Check ?fit_resamples for more information, but basically -- when you pipe the model spec into fit_resamples(), the second argument needs to be your recipe. If you want to use a workflow, you'll set the recipe earlier instead.

So, all that said, here's 2 equivalent methods for how you can use fit_resamples() here:

library(tidymodels)
recipe <- recipes::recipe(Sale_Price ~ ., data = modeldata::ames)
folds <- rsample::vfold_cv(modeldata::ames, v = 2)
rf_spec <- rand_forest() |> 
  set_mode("regression") |> 
  set_engine("ranger") # just to be faster

# These are ~equivalent:
rf_spec |> 
  tune::fit_resamples(recipe, folds)
#> # Resampling results
#> # 2-fold cross-validation 
#> # A tibble: 2 × 4
#>   splits              id    .metrics         .notes          
#>   <list>              <chr> <list>           <list>          
#> 1 <split [1465/1465]> Fold1 <tibble [2 × 4]> <tibble [0 × 3]>
#> 2 <split [1465/1465]> Fold2 <tibble [2 × 4]> <tibble [0 × 3]>

workflow(recipe, rf_spec) |> 
  tune::fit_resamples(folds)
#> # Resampling results
#> # 2-fold cross-validation 
#> # A tibble: 2 × 4
#>   splits              id    .metrics         .notes          
#>   <list>              <chr> <list>           <list>          
#> 1 <split [1465/1465]> Fold1 <tibble [2 × 4]> <tibble [0 × 3]>
#> 2 <split [1465/1465]> Fold2 <tibble [2 × 4]> <tibble [0 × 3]>

Created on 2023-10-04 with reprex v2.0.2

And I've tested and it works with your data (after using sf::st_drop_geometry() when making the recipe -- sorry, this is a known unfriendly behavior in recipes & tune) as well 😄

Thank you for the detailed explanation.

This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex: https://reprex.tidyverse.org) and link to this issue.