Error while executing a random forest regression using spatial sampling
Closed this issue · 5 comments
I want to perform a random forest (RF) regression to predict a continuous outcome. I'm using spatialsample
to partition the data set (I am aware of the issue #140). Whenever I try to perform the regression (using default parameters for the RF) I am getting this error:
Error in `vctrs::vec_slice()`:
! Can't subset elements with `i`.
✖ Logical subscript `i` must be size 1 or 6, not 7.
Run `rlang::last_trace()` to see where the error occurred.
Warning message:
In model_env_matches$mode == mode :
longer object length is not a multiple of shorter object length
I am not sure what the error means and how to solve it. Below is the code:
library(ggplot2)
library(spatialsample)
library(tidymodels)
library(textrecipes)
wd <- "path/"
# Projected reference system
provoliko <- "EPSG:24313"
df <- read.csv(paste0(wd, 'block.data.csv'))
eq1 <- ntl ~ pop + agbh + nir + ebbi + ndbi + road + pan + tirs
ames_sf <- sf::st_as_sf(df, coords = c("x", "y"), crs = provoliko)
set.seed(1234)
folds <- spatial_block_cv(ames_sf, v = 20)
recipe <- recipes::recipe(eq1, data = sf::st_drop_geometry(ames_sf))
# modeling
drought_result <- workflow(recipe %>%
rand_forest() %>%
set_mode("regression") %>%
set_engine("randomForest") %>%
tune::fit_resamples(folds,
control = control_resamples(save_pred = TRUE))) # the error
Here is a subset of the sf
object I am using:
A tibble: 10 × 11
ntl pop agbh nir ebbi ndbi road pan nbai tirs geometry
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <POINT [m]>
1 5.87 17.0 0.00230 0.300 -0.201 -0.243 48.1 0.108 -0.204 33.1 (435310.3 3470977)
2 17.8 77.8 0.729 0.242 0.0585 0.0630 213. 0.172 -0.0423 40.3 (446230.3 3482737)
3 7.06 34.6 0.0128 0.256 -0.152 -0.197 0 0.116 -0.164 34.3 (443710.3 3498277)
4 29.9 136. 3.33 0.148 0.0150 0.0271 233. 0.123 -0.0427 40.4 (440350.3 3497017)
5 12.0 33.8 1.01 0.244 0.00413 0.00466 23.1 0.152 -0.0747 37.1 (432370.3 3470557)
6 40.4 86.4 4.26 0.168 0.00922 0.0167 177. 0.129 -0.0528 38.1 (437830.3 3493657)
7 2.58 18.2 0.0795 0.265 -0.160 -0.212 0 0.110 -0.180 31.9 (452530.3 3497857)
8 2.00 14.9 0.000419 0.332 -0.287 -0.333 53.3 0.0959 -0.264 27.1 (453790.3 3477697)
9 6.74 20.8 0.0481 0.256 0.0657 0.0669 77.5 0.161 -0.0637 40.2 (451690.3 3480637)
10 26.8 81.4 1.87 0.238 -0.0379 -0.0415 142. 0.127 -0.115 38.3 (438250.3 3482737)
R
4.3.1 RStudio
2023.09.0 Build 463, Windows 10.
Howdy @nikosGeography !
Can you try building your folds with a regular data frame, instead of an sf object? For example:
folds <- spatial_block_cv(as.data.frame(sf::st_drop_geometry(ames_sf)), v = 20)
I'm pretty sure that should work. If it doesn't, can you upload (some of?) your data to this issue so I can take a look tomorrow? I don't want to type that data frame out 😆 your dput()
output from last time was also useful, if that's easier.
It didn't work. I received an error:
Error in `spatial_block_cv()`:
! `spatial_block_cv()` currently only supports `sf` objects.
ℹ Try converting `data` to an `sf` object via `sf::st_as_sf()`.
If you could take a look tomorrow, that would be nice. Here is a data.frame
with 10 rows:
structure(list(ntl = c(14.118185043335, 15.0372104644775, 24.6128005981445,
33.2751502990723, 3.09464168548584, 24.6151103973389, 11.3328771591187,
25.8314075469971, 26.6275043487549, 50.9468421936035), pop = c(87.3973999023438,
89.5049591064453, 69.193359375, 113.711647033691, 10.6829500198364,
91.0826187133789, 76.8622131347656, 86.4574127197266, 97.2555160522461,
144.035659790039), agbh = c(0.505127429962158, 3.07101631164551,
3.57691407203674, 2.53673243522644, 0.00112199422437698, 5.68933582305908,
0.250360578298569, 3.60711622238159, 4.05604648590088, 4.6323823928833
), nir = c(0.229171246290207, 0.194354802370071, 0.17390401661396,
0.229749098420143, 0.272861808538437, 0.181734174489975, 0.257997244596481,
0.195779532194138, 0.195880636572838, 0.191920325160027), ebbi = c(-0.0958575084805489,
-0.0110349524766207, 0.00237648747861385, -0.0257017724215984,
-0.0904427915811539, 0.0401351079344749, -0.145487412810326,
-0.00527992565184832, 0.013465441763401, 0.030767060816288),
ndbi = c(-0.133083552122116, -0.0179156363010406, 0.00375741114839911,
-0.0328953787684441, -0.115226343274117, 0.0590021200478077,
-0.184676945209503, -0.00770473340526223, 0.0193305220454931,
0.045164693146944), road = c(251.6875, 83.7782592773438,
487.718627929688, 125.576850891113, 14.2957067489624, 540.215454101562,
6.76938247680664, 463.005859375, 326.509704589844, 399.547058105469
), pan = c(0.114391185343266, 0.131493717432022, 0.133256837725639,
0.135219603776932, 0.131857812404633, 0.150442510843277,
0.11356084048748, 0.136978164315224, 0.143271997570992, 0.150269523262978
), nbai = c(-0.140348717570305, -0.0788049399852753, -0.0648676380515099,
-0.098626047372818, -0.122976809740067, -0.0357712768018246,
-0.161113634705544, -0.0757625624537468, -0.0654538571834564,
-0.0543252900242805), tirs = c(34.5458335876465, 35.3345603942871,
36.5489959716797, 36.8556747436523, 35.1384429931641, 36.5604209899902,
33.3936080932617, 36.6631050109863, 37.0490875244141, 37.2179985046387
), geometry = structure(list(structure(c(446230.3092, 3490716.9101
), class = c("XY", "POINT", "sfg")), structure(c(448750.3092,
3493656.9101), class = c("XY", "POINT", "sfg")), structure(c(434470.3092,
3479796.9101), class = c("XY", "POINT", "sfg")), structure(c(440770.3092,
3491136.9101), class = c("XY", "POINT", "sfg")), structure(c(436570.3092,
3470556.9101), class = c("XY", "POINT", "sfg")), structure(c(439510.3092,
3478536.9101), class = c("XY", "POINT", "sfg")), structure(c(426490.3092,
3482736.9101), class = c("XY", "POINT", "sfg")), structure(c(429430.3092,
3477276.9101), class = c("XY", "POINT", "sfg")), structure(c(434890.3092,
3479376.9101), class = c("XY", "POINT", "sfg")), structure(c(430270.3092,
3481476.9101), class = c("XY", "POINT", "sfg"))), class = c("sfc_POINT",
"sfc"), precision = 0, bbox = structure(c(xmin = 426490.3092,
ymin = 3470556.9101, xmax = 448750.3092, ymax = 3493656.9101
), class = "bbox"), crs = structure(list(input = "EPSG:24313",
wkt = "PROJCRS[\"Kalianpur 1962 / UTM zone 43N\",\n BASEGEOGCRS[\"Kalianpur 1962\",\n DATUM[\"Kalianpur 1962\",\n ELLIPSOID[\"Everest 1830 (1962 Definition)\",6377301.243,300.8017255,\n LENGTHUNIT[\"metre\",1]]],\n PRIMEM[\"Greenwich\",0,\n ANGLEUNIT[\"degree\",0.0174532925199433]],\n ID[\"EPSG\",4145]],\n CONVERSION[\"UTM zone 43N\",\n METHOD[\"Transverse Mercator\",\n ID[\"EPSG\",9807]],\n PARAMETER[\"Latitude of natural origin\",0,\n ANGLEUNIT[\"degree\",0.0174532925199433],\n ID[\"EPSG\",8801]],\n PARAMETER[\"Longitude of natural origin\",75,\n ANGLEUNIT[\"degree\",0.0174532925199433],\n ID[\"EPSG\",8802]],\n PARAMETER[\"Scale factor at natural origin\",0.9996,\n SCALEUNIT[\"unity\",1],\n ID[\"EPSG\",8805]],\n PARAMETER[\"False easting\",500000,\n LENGTHUNIT[\"metre\",1],\n ID[\"EPSG\",8806]],\n PARAMETER[\"False northing\",0,\n LENGTHUNIT[\"metre\",1],\n ID[\"EPSG\",8807]]],\n CS[Cartesian,2],\n AXIS[\"(E)\",east,\n ORDER[1],\n LENGTHUNIT[\"metre\",1]],\n AXIS[\"(N)\",north,\n ORDER[2],\n LENGTHUNIT[\"metre\",1]],\n USAGE[\n SCOPE[\"Engineering survey, topographic mapping.\"],\n AREA[\"Pakistan - east of 72°E.\"],\n BBOX[28.21,72,37.07,77.83]],\n ID[\"EPSG\",24313]]"), class = "crs"), n_empty = 0L)), row.names = c(NA,
10L), sf_column = "geometry", agr = structure(c(ntl = NA_integer_,
pop = NA_integer_, agbh = NA_integer_, nir = NA_integer_, ebbi = NA_integer_,
ndbi = NA_integer_, road = NA_integer_, pan = NA_integer_, nbai = NA_integer_,
tirs = NA_integer_), class = "factor", levels = c("constant",
"aggregate", "identity")), class = c("sf", "data.frame"))
So the core issue here is in how you've got that workflow()
call set up. You might find https://www.tidymodels.org/start/tuning/ useful for how to work with this function and package.
The first issue is that rand_forest()
's first argument is mode
, which takes a character string specifying whether you're doing regression or classification. It doesn't take a recipe, and so piping your recipe into the random forest errors:
library(tidymodels)
recipe <- recipes::recipe(Sale_Price ~ ., modeldata::ames)
recipe %>%
rand_forest()
#> Warning in model_env_matches$mode == mode: longer object length is not a
#> multiple of shorter object length
#> Error in `vctrs::vec_slice()`:
#> ! Can't subset elements with `i`.
#> ✖ Logical subscript `i` must be size 1 or 6, not 7.
#> Backtrace:
#> ▆
#> 1. ├─recipe %>% rand_forest()
#> 2. ├─parsnip::rand_forest(.)
#> 3. │ └─parsnip::new_model_spec(...)
#> 4. │ └─parsnip::spec_is_possible(spec = out)
#> 5. │ └─vctrs::vec_slice(...)
#> 6. └─vctrs (local) `<fn>`()
#> 7. └─vctrs:::stop_indicator_size(...)
#> 8. └─rlang::cnd_signal(...)
Created on 2023-10-04 with reprex v2.0.2
The next few pipes are fine -- you can set up a model specification using rand_forest() %>% set_mode() %>% set_engine()
without an error:
library(tidymodels)
rand_forest() |>
set_mode("regression") |>
set_engine("randomForest")
#> Random Forest Model Specification (regression)
#>
#> Computational engine: randomForest
Created on 2023-10-04 with reprex v2.0.2
But then the arguments to fit_resamples()
depend upon if you want to use a workflow or a normal model specification. Check ?fit_resamples
for more information, but basically -- when you pipe the model spec into fit_resamples()
, the second argument needs to be your recipe. If you want to use a workflow, you'll set the recipe earlier instead.
So, all that said, here's 2 equivalent methods for how you can use fit_resamples()
here:
library(tidymodels)
recipe <- recipes::recipe(Sale_Price ~ ., data = modeldata::ames)
folds <- rsample::vfold_cv(modeldata::ames, v = 2)
rf_spec <- rand_forest() |>
set_mode("regression") |>
set_engine("ranger") # just to be faster
# These are ~equivalent:
rf_spec |>
tune::fit_resamples(recipe, folds)
#> # Resampling results
#> # 2-fold cross-validation
#> # A tibble: 2 × 4
#> splits id .metrics .notes
#> <list> <chr> <list> <list>
#> 1 <split [1465/1465]> Fold1 <tibble [2 × 4]> <tibble [0 × 3]>
#> 2 <split [1465/1465]> Fold2 <tibble [2 × 4]> <tibble [0 × 3]>
workflow(recipe, rf_spec) |>
tune::fit_resamples(folds)
#> # Resampling results
#> # 2-fold cross-validation
#> # A tibble: 2 × 4
#> splits id .metrics .notes
#> <list> <chr> <list> <list>
#> 1 <split [1465/1465]> Fold1 <tibble [2 × 4]> <tibble [0 × 3]>
#> 2 <split [1465/1465]> Fold2 <tibble [2 × 4]> <tibble [0 × 3]>
Created on 2023-10-04 with reprex v2.0.2
And I've tested and it works with your data (after using sf::st_drop_geometry()
when making the recipe -- sorry, this is a known unfriendly behavior in recipes & tune) as well 😄
Thank you for the detailed explanation.
This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex: https://reprex.tidyverse.org) and link to this issue.