
`CppMethod` error when applying prepped UMAP recipe after saving/reading as `.rds`

Seems like there is a bug ๐Ÿ› for step_umap() when trying to save a prepped recipe as .rds and reading it back to apply it new data.

split <- seq.int(1, 150, by = 9)
tr <- iris[-split, ]
te <- iris[ split, ]

supervised <- 
   recipe(Species ~ ., data = tr) %>%
   step_center(all_predictors()) %>% 
   step_scale(all_predictors()) %>% 
   step_umap(all_predictors(), outcome = vars(Species), num_comp = 2) %>% 
   prep(training = tr)

write_rds(supervised, here::here(tempdir(), "umap.rds"))
saved_rec <- read_rds(here::here(tempdir(), "umap.rds"))
saved_rec %>% bake(new_data = te)
#> Error in .External(structure(list(name = "CppMethod__invoke_notvoid", : NULL value passed as symbol address

I'm sure this is not us (i.e. not the embed package) but I wonder if there is anything we can do about this.

The recipe is fine if you don't save as .rds and then read it back.

I am very late to discovering this, but yes this is almost certainly because of the underlying UMAP package (uwot), which uses RcppAnnoy, which itself wraps the C++ library Annoy to find approximate nearest neighbors. The RcppAnnoy objects have save and load methods that must be called and just using saveRDS with them won't work (at least I couldn't get it to work). In turn uwot needs to provide special functions to save and load its state but it's all very unsatisfactory. Sorry about that. I was unable to think of a workaround.

I do intend to fix this but my current solution involves writing an entirely new approximate nearest neighbors package. As that and maintaining uwot exists entirely as a spare time endeavor, it's taking quite a long time (3 years and counting for the nearest neighbor package). I'll get there in the end. Probably.

Thanks for the message @jlmelville and for your work on uwot! ๐Ÿ™Œ We also are thinking about serialization for trained model objects like xgboost, torch, etc, that have native methods for saving/loading. Definitely an area that needs some attention from all of us!

This has now been solved with the new bundle package:


temp_file <- fs::file_temp(pattern = "umap", ext = "rds")
bundle(supervised) %>% write_rds(temp_file)

saved_rec <- read_rds(temp_file)
unbundle(saved_rec) %>% bake(new_data = te)
#> # A tibble: 17 ร— 3
#>    Species     UMAP1  UMAP2
#>    <fct>       <dbl>  <dbl>
#>  1 setosa      13.3    2.93
#>  2 setosa      12.0    4.69
#>  3 setosa      14.5    3.12
#>  4 setosa      13.5    3.07
#>  5 setosa      13.4    2.99
#>  6 setosa      12.0    4.86
#>  7 versicolor -10.1    8.80
#>  8 versicolor  -9.79   8.28
#>  9 versicolor  -4.91 -11.6 
#> 10 versicolor  -9.66   6.12
#> 11 versicolor -10.1    6.61
#> 12 versicolor -10.3    6.98
#> 13 virginica   -4.14 -11.6 
#> 14 virginica   -2.69 -12.1 
#> 15 virginica   -4.06 -10.3 
#> 16 virginica   -1.73 -11.5 
#> 17 virginica   -2.33 -10.9

We should document somewhere that this step needs to be bundled for use in a new session. How do you all want to do that?

Looks like I need to get in on this bundle thing...

I think we should document it as a section. Like we do with Tidying and Case weights, this way it will be easier to link to the documentation when the question pops up

Agreed. We just did this for the parsnip engine docs.

