Question regarding recipes
nipnipj opened this issue · 8 comments
Hello!
How can we correctly use recipes with spatial data? I'm getting the following error The number of roles should be the same as the number of variables
with
data("ames", package = "modeldata")
data_raw <- st_as_sf(ames, coords = c("Longitude", "Latitude")) %>%
mutate(Sale_Price = log(Sale_Price))
Hi @nipnipj ! Can you please provide a reprex that shows the error you're getting? The code you provided should run perfectly fine, and without seeing what code you're running to trigger that error I'm not able to guess what's going on here.
data("ames", package = "modeldata")
(data_raw <- sf::st_as_sf(ames, coords = c("Longitude", "Latitude")) |>
dplyr::mutate(Sale_Price = log(Sale_Price)))
#> Simple feature collection with 2930 features and 72 fields
#> Geometry type: POINT
#> Dimension: XY
#> Bounding box: xmin: -93.69315 ymin: 41.9865 xmax: -93.57743 ymax: 42.06339
#> CRS: NA
#> # A tibble: 2,930 × 73
#> MS_SubClass MS_Zoning Lot_Frontage Lot_Area Street Alley Lot_Shape
#> * <fct> <fct> <dbl> <int> <fct> <fct> <fct>
#> 1 One_Story_1946_and_Ne… Resident… 141 31770 Pave No_A… Slightly…
#> 2 One_Story_1946_and_Ne… Resident… 80 11622 Pave No_A… Regular
#> 3 One_Story_1946_and_Ne… Resident… 81 14267 Pave No_A… Slightly…
#> 4 One_Story_1946_and_Ne… Resident… 93 11160 Pave No_A… Regular
#> 5 Two_Story_1946_and_Ne… Resident… 74 13830 Pave No_A… Slightly…
#> 6 Two_Story_1946_and_Ne… Resident… 78 9978 Pave No_A… Slightly…
#> 7 One_Story_PUD_1946_an… Resident… 41 4920 Pave No_A… Regular
#> 8 One_Story_PUD_1946_an… Resident… 43 5005 Pave No_A… Slightly…
#> 9 One_Story_PUD_1946_an… Resident… 39 5389 Pave No_A… Slightly…
#> 10 Two_Story_1946_and_Ne… Resident… 60 7500 Pave No_A… Regular
#> # ℹ 2,920 more rows
#> # ℹ 66 more variables: Land_Contour <fct>, Utilities <fct>, Lot_Config <fct>,
#> # Land_Slope <fct>, Neighborhood <fct>, Condition_1 <fct>, Condition_2 <fct>,
#> # Bldg_Type <fct>, House_Style <fct>, Overall_Cond <fct>, Year_Built <int>,
#> # Year_Remod_Add <int>, Roof_Style <fct>, Roof_Matl <fct>,
#> # Exterior_1st <fct>, Exterior_2nd <fct>, Mas_Vnr_Type <fct>,
#> # Mas_Vnr_Area <dbl>, Exter_Cond <fct>, Foundation <fct>, Bsmt_Cond <fct>, …
Created on 2023-05-03 with reprex v2.0.2
If this is really a {recipes} question, then we are aware that it doesn't work. We have ideas of how to make it work but it isn't scheduled to happen in the near or medium future.
library(recipes)
library(sf)
data("ames", package = "modeldata")
data_raw <- st_as_sf(ames, coords = c("Longitude", "Latitude")) %>%
mutate(Sale_Price = log(Sale_Price))
recipe(~., data = data_raw)
#> Error in model.frame.default(formula, data[1, ]): invalid type (list) for variable 'geometry'
Yes, I forgot to add
rec <- data_raw %>%
recipe(Sale_Price ~ Year_Built + Gr_Liv_Area + Bldg_Type)
I see thank you both for answering!
Now here's a question for @EmilHvitfeldt (thanks for stepping in 😄 ) -- any reason to expect the below to error or cause problems? Specifically, dropping the spatial information for the recipe specification, but fitting to resamples from spatialsample?
data("ames", package = "modeldata")
ames_sf <- sf::st_as_sf(ames, coords = c("Longitude", "Latitude"), crs = 4326)
# Drop the spatial information for the recipe:
recipe <- recipes::recipe(Sale_Price ~ Year_Built, data = sf::st_drop_geometry(ames_sf)) |>
recipes::step_log(recipes::all_outcomes())
workflows::workflow(recipe, parsnip::linear_reg()) |>
# but keep it when assigning resamples
tune::fit_resamples(spatialsample::spatial_clustering_cv(ames_sf))
#> # Resampling results
#> # 10-fold spatial cross-validation
#> # A tibble: 10 × 4
#> splits id .metrics .notes
#> <list> <chr> <list> <list>
#> 1 <split [2559/371]> Fold01 <tibble [2 × 4]> <tibble [0 × 3]>
#> 2 <split [2740/190]> Fold02 <tibble [2 × 4]> <tibble [0 × 3]>
#> 3 <split [2685/245]> Fold03 <tibble [2 × 4]> <tibble [0 × 3]>
#> 4 <split [2777/153]> Fold04 <tibble [2 × 4]> <tibble [0 × 3]>
#> 5 <split [2656/274]> Fold05 <tibble [2 × 4]> <tibble [0 × 3]>
#> 6 <split [2668/262]> Fold06 <tibble [2 × 4]> <tibble [0 × 3]>
#> 7 <split [2496/434]> Fold07 <tibble [2 × 4]> <tibble [0 × 3]>
#> 8 <split [2570/360]> Fold08 <tibble [2 × 4]> <tibble [0 × 3]>
#> 9 <split [2709/221]> Fold09 <tibble [2 × 4]> <tibble [0 × 3]>
#> 10 <split [2510/420]> Fold10 <tibble [2 × 4]> <tibble [0 × 3]>
Created on 2023-05-03 with reprex v2.0.2
It might break in the future, before it gets official support 😬 We are being bitten by non-tibble-tibbles. So we are starting to force data.frames to be bare data.frames internally some places, while we wait for potential future native sf support.
See r-spatial/sf#2131 for reference for some of the struggles
I think that wouldn't cause any problems (and makes a lot of sense for tidymodels to do, if you're accepting inputs of any subclass and expecting them to not have any different methods or behaviors from tibbles). To be clear, that sf::st_drop_geometry()
in the recipe()
call is already casting the sf object to a tibble:
data("ames", package = "modeldata")
ames_sf <- sf::st_as_sf(ames, coords = c("Longitude", "Latitude"), crs = 4326)
sf::st_drop_geometry(ames_sf)
#> # A tibble: 2,930 × 72
#> MS_SubClass MS_Zoning Lot_Frontage Lot_Area Street Alley Lot_Shape
#> * <fct> <fct> <dbl> <int> <fct> <fct> <fct>
#> 1 One_Story_1946_and_Ne… Resident… 141 31770 Pave No_A… Slightly…
#> 2 One_Story_1946_and_Ne… Resident… 80 11622 Pave No_A… Regular
#> 3 One_Story_1946_and_Ne… Resident… 81 14267 Pave No_A… Slightly…
#> 4 One_Story_1946_and_Ne… Resident… 93 11160 Pave No_A… Regular
#> 5 Two_Story_1946_and_Ne… Resident… 74 13830 Pave No_A… Slightly…
#> 6 Two_Story_1946_and_Ne… Resident… 78 9978 Pave No_A… Slightly…
#> 7 One_Story_PUD_1946_an… Resident… 41 4920 Pave No_A… Regular
#> 8 One_Story_PUD_1946_an… Resident… 43 5005 Pave No_A… Slightly…
#> 9 One_Story_PUD_1946_an… Resident… 39 5389 Pave No_A… Slightly…
#> 10 Two_Story_1946_and_Ne… Resident… 60 7500 Pave No_A… Regular
#> # ℹ 2,920 more rows
#> # ℹ 65 more variables: Land_Contour <fct>, Utilities <fct>, Lot_Config <fct>,
#> # Land_Slope <fct>, Neighborhood <fct>, Condition_1 <fct>, Condition_2 <fct>,
#> # Bldg_Type <fct>, House_Style <fct>, Overall_Cond <fct>, Year_Built <int>,
#> # Year_Remod_Add <int>, Roof_Style <fct>, Roof_Matl <fct>,
#> # Exterior_1st <fct>, Exterior_2nd <fct>, Mas_Vnr_Type <fct>,
#> # Mas_Vnr_Area <dbl>, Exter_Cond <fct>, Foundation <fct>, Bsmt_Cond <fct>, …
Created on 2023-05-04 with reprex v2.0.2
The data in the resamples from spatial_clustering_cv()
is still an sf object, but the recipe isn't looking for the geometry column, so casting that to a tibble should be fine. I think that means this should be decently future-proof. It does mean you can't directly include geometry columns as predictors or as recipe steps, but tidymodels doesn't support that anyway, so I think this workaround will work for most use-cases.
Going to go ahead and close this issue now, as it sounds like there might be better venues ( https://github.com/tidymodels/planning/ , maybe?) for discussions of how & if tidymodels wants to support spatial data moving forward. I believe that the workaround I shared will be pretty robust going forward, as only spatialsample functions actually need an sf object, so fit_resamples()
or recipe()
casting to data frames shouldn't matter, but future readers of this thread be aware that situations may have changed!
This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex: https://reprex.tidyverse.org) and link to this issue.