Load .RDS files directly into environment `gcs_get_object`?
samuel-marsh opened this issue ยท 15 comments
Hi,
This might be naive question and I might be missing something but wondering if there is way to load file saved as a .RDS file from GCP bucket directly into local R environment without saving to disk first?
I have been currently trying this with objects created with the single cell analysis package Seurat which creates S4 class object (See more info on Seurat Objects format see here: https://github.com/mojaveazure/seurat-object and here: https://github.com/satijalab/seurat/wiki.
When I run:
obj <- gcs_get_object(object_name = "gs://bucket_name/obj.RDS")
It loads into the environment as a "Raw" file that is then unreadable by Seurat. If I add saveToDisk = "obj.RDS"
and then subsequently read it into R with readRDS
(or wrapper read_rds
) then it works just fine and is readable by Seurat.
Wondering whether there is additional parameter I missing specifying that would allow this or if not whether this is feature that could be added?
Thanks!
Sam
Yes you can supply a custom parse function to load the object directly into R. You would want something like readRDS().
All the downloads write to disk at least temporarily so it's not more efficient, but a lot more convenient:)
Hi Mark,
Thanks for quick response. This must be what I'm not quite understanding because when I run:
obj <- gcs_get_object(object_name = "gs://bucket_name/obj.RDS", parseFunction = readRDS())
I get an error that the parsing failed.
Thanks!
Sam
Sorry I thought this would be simpler but actually the raw RDS response is harder to deal with than I thought. The best I can come up with is a wrapper to saveToDisk then load it which will do what I thought it should do:
my_parse <- function(obj){
tmp <- tempfile(fileext = ".rds")
on.exit(unlink(tmp))
suppressMessages(gcs_get_object(obj, saveToDisk = tmp))
readRDS(tmp)
}
obj <- my_parse("gs://bucket_name/obj.RDS")
I will look at if this can be improved :)
Rich Fergie found the right functions for parsing RDS without needing to save to disk for you: https://twitter.com/RichardFergie/status/1385531335423447040
f <- function(obj) {
readRDS(gzcon(rawConnection(httr::content(obj))))
}
gcs_get_object("obj.rds", parseFunction = f)
I added the function as a helper as it looked useful, so for the GitHub version you can use:
gcs_get_object("obj.rds", parseFunction = gcs_parse_rds)
See ?gcs_parse_rds
Hey Mark,
Really appreciate your help on this! Unfortunately still getting errors when I try myself. Although the errors are different depending on whether it is the GitHub branch or CRAN version.
Using github master branch and running the code below results in following error:
test <- gcs_get_object(object_name = "gs://bucket_name/exp17.RDS", parseFunction = gcs_parse_rds)
i Downloading exp17_micro.RDSError: Problem parsing the object with supplied parseFunction.
x Downloading exp17_micro.RDS ... failed
If I revert to the CRAN version and using the custom parse function itself from global env I get following error messages:
f <- function(obj) {
readRDS(gzcon(rawConnection(httr::content(obj))))
}
test <- gcs_get_object(object_name = "gs://bucket_name/exp17.RDS", parseFunction = gcs_parse_rds)
Downloaded exp17_micro.RDS
Error in readRDS(gzcon(rawConnection(httr::content(obj)))) :
too large a block specified
Error in gcs_get_object(object_name = "gs://stevens_data_marsh/exp17_micro.RDS", :
Problem parsing the object with supplied parseFunction.
For reference the RDS object that I'm testing this with is 2.4GB.
Also including sessionInfo below for reference in case it's helpful!
Thanks again so much for all your help on this and quick response!!
Sam
> sessionInfo()
R version 3.6.1 (2019-07-05)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Catalina 10.15.3
Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] beepr_1.3 Seurat_3.2.3
[3] forcats_0.5.0 stringr_1.4.0
[5] dplyr_1.0.5 purrr_0.3.4
[7] readr_1.3.1 tidyr_1.1.0
[9] tibble_3.0.1 ggplot2_3.3.0
[11] tidyverse_1.3.0 googleCloudStorageR_0.6.0
loaded via a namespace (and not attached):
[1] Rtsne_0.15 colorspace_1.4-1 deldir_0.1-28
[4] ellipsis_0.3.1 ggridges_0.5.2 fs_1.4.1
[7] spatstat.data_1.4-3 rstudioapi_0.11 leiden_0.3.3
[10] listenv_0.8.0 remotes_2.1.1 audio_0.1-7
[13] ggrepel_0.8.2 lubridate_1.7.8 xml2_1.3.2
[16] codetools_0.2-16 splines_3.6.1 polyclip_1.10-0
[19] jsonlite_1.6.1 packrat_0.5.0 broom_0.5.6
[22] ica_1.0-2 cluster_2.1.0 dbplyr_1.4.3
[25] png_0.1-7 uwot_0.1.10 sctransform_0.3.1
[28] shiny_1.4.0.2 compiler_3.6.1 httr_1.4.1
[31] backports_1.1.7 lazyeval_0.2.2 assertthat_0.2.1
[34] Matrix_1.2-18 fastmap_1.0.1 gargle_1.1.0
[37] cli_2.4.0 later_1.0.0 htmltools_0.5.1.1
[40] tools_3.6.1 rsvd_1.0.3 igraph_1.2.5
[43] gtable_0.3.0 glue_1.4.1 reshape2_1.4.4
[46] RANN_2.6.1 rappdirs_0.3.1 spatstat_1.64-1
[49] Rcpp_1.0.6 scattermore_0.7 cellranger_1.1.0
[52] vctrs_0.3.6 nlme_3.1-148 lmtest_0.9-37
[55] globals_0.14.0 rvest_0.3.5 mime_0.9
[58] miniUI_0.1.1.1 lifecycle_1.0.0 irlba_2.3.3
[61] goftest_1.2-2 future_1.21.0 googleAuthR_1.3.1
[64] MASS_7.3-51.6 zoo_1.8-8 scales_1.1.1
[67] spatstat.utils_1.17-0 hms_0.5.3 promises_1.1.0
[70] parallel_3.6.1 RColorBrewer_1.1-2 yaml_2.2.1
[73] curl_4.3 gridExtra_2.3 memoise_1.1.0
[76] reticulate_1.15 pbapply_1.4-2 rpart_4.1-15
[79] stringi_1.4.6 zip_2.0.4 rlang_0.4.10
[82] pkgconfig_2.0.3 matrixStats_0.56.0 lattice_0.20-41
[85] tensor_1.5 ROCR_1.0-11 patchwork_1.0.0
[88] htmlwidgets_1.5.1 cowplot_1.0.0 tidyselect_1.1.0
[91] parallelly_1.21.0 RcppAnnoy_0.0.18 plyr_1.8.6
[94] magrittr_1.5 R6_2.4.1 generics_0.0.2
[97] DBI_1.1.0 mgcv_1.8-31 pillar_1.4.4
[100] haven_2.3.0 withr_2.2.0 fitdistrplus_1.1-1
[103] abind_1.4-5 survival_3.1-12 future.apply_1.5.0
[106] modelr_0.1.8 crayon_1.3.4 KernSmooth_2.23-17
[109] plotly_4.9.2.1 grid_3.6.1 readxl_1.3.1
[112] data.table_1.12.8 reprex_0.3.0 digest_0.6.25
[115] xtable_1.8-4 httpuv_1.5.2 openssl_1.4.1
[118] munsell_0.5.0 viridisLite_0.3.0 askpass_1.1
Ok cool, seems your RDS is a special case compared to mine ;) May I ask if the RDS files you are using "old" in that they were done before R 3.5? They changed the format type in that release, just trying to eliminate it as a cause.
Could you also issue traceback()
after your error to see which function is triggering it?
And I guess writing to disk should work ok?
my_parse <- function(obj){
tmp <- tempfile(fileext = ".rds")
on.exit(unlink(tmp))
suppressMessages(gcs_get_object(obj, saveToDisk = tmp))
readRDS(tmp)
}
obj <- my_parse("gs://bucket_name/obj.RDS")
It may be that 2.4GB is just too big for R to decompress
FYI: for me, this works with a 10.2GB .RDS file that is saved without compression (with readr::write_rds). So the file size per se, at least, is not the issue. Thanks for implementing this very convenient parser function!
Thanks @LukasWallrich good to know. I think then @samuel-marsh 's rds file must have something unique about it - if it is downloaded locally trying to debug where the readRDS(gzcon(rawConnection(httr::content(obj))))
goes wrong would be a start.
Sorry I thought this would be simpler but actually the raw RDS response is harder to deal with than I thought. The best I can come up with is a wrapper to saveToDisk then load it which will do what I thought it should do:
my_parse <- function(obj){ tmp <- tempfile(fileext = ".rds") on.exit(unlink(tmp)) suppressMessages(gcs_get_object(obj, saveToDisk = tmp)) readRDS(tmp) } obj <- my_parse("gs://bucket_name/obj.RDS")I will look at if this can be improved :)
Somehow unrelated, this strategy also works for parsing UTF-16LE CSV files, which I haven't managed to do by just using read.csv(x, fileEncoding = "UTF-16LE")
as the parseFunction
.
I forgot to put here that gce_parse_rds()
in now in the dev version vai this commit d912d0c
If there are other useful parsing functions I'd be glad to put them in.
@MarkEdmondson1234 - I think you might have meant to type gcs_parse_rds()
.
Thank you so much for your contributions! googleCloudStorageR and googleCloudRunner are incredibly useful tools.
Ah yes that is it gcs_ vs gce_ - got confusing sometimes working on the packages at same time ;) glad they are helpful!