DavisVaughan/furrr

future_walk work in parallel in script function, but not in R package function

Closed this issue · 13 comments

Dear developer,
When I wrote an R scirpt function with future_walk, it can work in parallel, but if I wrap this R function in R package, it works in sequential.

future::availableCores()
system
160
################## R script function, it works fine.
mutect2 <- function(config, interval_dir){
intervals <- dir_ls(interval_dir, glob = "*-scattered.interval_list")
oplan <- plan(multisession, workers = 60)
on.exit(plan(oplan), add = TRUE)
future_walk(intervals, ~ mutect2_wes_one(config, .x))
}

run: mutect2(config, interval_dir) is fine.

################## R package mypkg function, and call this function outside R package, e.g, mypkg::mutect2. It does not work as expected.

future::availableCores()
system
160
mutect2 <- function(config, interval_dir){
intervals <- dir_ls(interval_dir, glob = "*-scattered.interval_list")
oplan <- plan(multisession, workers = 60)
on.exit(plan(oplan), add = TRUE)
future_walk(intervals, ~ mutect2_wes_one(config, .x))
}

run: mypkg::mutect2(config, interval_dir) does not work as expected.

packageVersion("furrr")
[1] '0.3.0.9000'

packageVersion("future")
[1] '1.25.0'

sessionInfo()
R version 4.1.1 (2021-08-10)
Platform: x86_64-conda-linux-gnu (64-bit)
Running under: CentOS Linux 8

Matrix products: default
BLAS/LAPACK: /cluster/apps/anaconda3/2020.02/envs/R-4.1.1/lib/libopenblasp-r0.3.17.so

locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats graphics grDevices utils datasets methods base

other attached packages:
[1] furrr_0.3.0.9000 future_1.25.0 jhtools_1.0.0
[4] glue_1.6.2 jhuanglabwgs_1.0.0 optparse_1.7.1
[7] configr_0.3.5 futile.logger_1.4.3 pak_0.3.0
[10] devtools_2.4.3 usethis_2.1.5 rvcheck_0.2.1
[13] forcats_0.5.1 stringr_1.4.0 dplyr_1.0.9
[16] purrr_0.3.4 readr_2.1.2 tidyr_1.2.0
[19] tibble_3.1.7 ggplot2_3.3.6 tidyverse_1.3.1
[22] fs_1.5.2 wget_0.0.1

loaded via a namespace (and not attached):
[1] utf8_1.2.2 tidyselect_1.1.2
[3] htmlwidgets_1.5.4 RSQLite_2.2.14
[5] AnnotationDbi_1.54.1 grid_4.1.1
[7] BiocParallel_1.28.3 munsell_0.5.0
[9] codetools_0.2-18 withr_2.5.0
[11] colorspace_2.0-3 Biobase_2.54.0
[13] filelock_1.0.2 ggfortify_0.4.14
[15] knitr_1.39 rstudioapi_0.13
[17] stats4_4.1.1 ggsignif_0.6.3
[19] listenv_0.8.0 MatrixGenerics_1.6.0
[21] tximport_1.20.0 GenomeInfoDbData_1.2.7
[23] ini_0.3.1 bit64_4.0.5
[25] rprojroot_2.0.3 parallelly_1.31.1
[27] vctrs_0.4.1 generics_0.1.2
[29] xfun_0.30 lambda.r_1.2.4
[31] biovizBase_1.40.0 BiocFileCache_2.2.1
[33] regioneR_1.24.0 R6_2.5.1
[35] GenomeInfoDb_1.30.1 AnnotationFilter_1.16.0
[37] bitops_1.0-7 cachem_1.0.6
[39] DelayedArray_0.20.0 assertthat_0.2.1
[41] BiocIO_1.2.0 scales_1.2.0
[43] nnet_7.3-17 gtable_0.3.0
[45] globals_0.15.0 processx_3.5.3
[47] ensembldb_2.16.4 rlang_1.0.2
[49] splines_4.1.1 lazyeval_0.2.2
[51] rtracklayer_1.52.1 rstatix_0.7.0
[53] dichromat_2.0-0.1 checkmate_2.1.0
[55] broom_0.8.0 BiocManager_1.30.17
[57] yaml_2.3.5 abind_1.4-5
[59] modelr_0.1.8 GenomicFeatures_1.44.2
[61] backports_1.4.1 Hmisc_4.7-0
[63] tools_4.1.1 ellipsis_0.3.2
[65] gplots_3.1.3 RColorBrewer_1.1-3
[67] karyoploteR_1.18.0 DNAcopy_1.66.0
[69] BiocGenerics_0.40.0 sessioninfo_1.2.2
[71] Rcpp_1.0.8.3 base64enc_0.1-3
[73] progress_1.2.2 zlibbioc_1.40.0
[75] RCurl_1.98-1.6 ps_1.7.0
[77] prettyunits_1.1.1 rpart_4.1.16
[79] ggpubr_0.4.0 RcppTOML_0.1.7
[81] S4Vectors_0.32.4 cluster_2.1.3
[83] SummarizedExperiment_1.24.0 haven_2.5.0
[85] magrittr_2.0.3 data.table_1.14.2
[87] futile.options_1.0.1 openxlsx_4.2.5
[89] reprex_2.0.1 ProtGenerics_1.24.0
[91] matrixStats_0.62.0 pkgload_1.2.4
[93] hms_1.1.1 patchwork_1.1.1
[95] XML_3.99-0.9 jpeg_0.1-9
[97] readxl_1.4.0 IRanges_2.28.0
[99] gridExtra_2.3 testthat_3.1.4
[101] compiler_4.1.1 biomaRt_2.48.3
[103] KernSmooth_2.23-20 crayon_1.5.1
[105] htmltools_0.5.2 tzdb_0.3.0
[107] Formula_1.2-4 lubridate_1.8.0
[109] DBI_1.1.2 formatR_1.12
[111] corrplot_0.92 dbplyr_2.1.1
[113] rappdirs_0.3.3 Matrix_1.4-1
[115] getopt_1.20.3 car_3.0-13
[117] brio_1.1.3 cli_3.3.0
[119] gdata_2.18.0 parallel_4.1.1
[121] GenomicRanges_1.46.1 pkgconfig_2.0.3
[123] GenomicAlignments_1.28.0 foreign_0.8-82
[125] xml2_1.3.3 XVector_0.34.0
[127] rvest_1.0.2 yulab.utils_0.0.4
[129] bezier_1.1.2 VariantAnnotation_1.38.0
[131] callr_3.7.0 digest_0.6.29
[133] Biostrings_2.60.2 cellranger_1.1.0
[135] htmlTable_2.4.0 restfulr_0.0.13
[137] curl_4.3.2 Rsamtools_2.8.0
[139] gtools_3.9.2 rjson_0.2.21
[141] lifecycle_1.0.1 jsonlite_1.8.0
[143] carData_3.0-5 desc_1.4.1
[145] limma_3.50.3 BSgenome_1.60.0
[147] fansi_1.0.3 pillar_1.7.0
[149] lattice_0.20-45 survival_3.3-1
[151] KEGGREST_1.32.0 fastmap_1.1.0
[153] httr_1.4.3 pkgbuild_1.3.1
[155] remotes_2.4.2 conflicted_1.1.0
[157] zip_2.2.0 bamsignals_1.24.0
[159] png_0.1-7 bit_4.0.4
[161] stringi_1.7.6 blob_1.2.3
[163] org.Hs.eg.db_3.13.0 latticeExtra_0.6-29
[165] caTools_1.18.2 memoise_2.0.1

It does not work as expected

You haven't explained what the actual problem is. Can you please provide some output for the failing case?

It does not fail. Just does not work as expected. Calling with R scirpt function, it can use 60 workers in parallel. Calling with R package function mypkg::mutect2(config, interval_dir), it only uses two workers in sequential. I can repeat this problem stably. I have tried .env_globals = rlang::global_env() or .env_globals = parent.frame(). It does not help.

future_walk(intervals, ~ mutect2_wes_one(config, .x), .env_globals = rlang::global_env())

I just think it is the function calling environment that caused this problem.

So the problem is that it is running sequentially when called through the package, even though you set plan(multisession) in the package function? But if you don't put it in a package then it correctly runs in parallel?

That sounds strange to me.

It is unlikely to be a function environment issue if that is the case.

Can you point me to a repo on GitHub that has this package in it? Or can you create a repo on GitHub that demonstrates this problem for you? I am unlikely to be able to help you otherwise

By the way, setting plan() inside a function is typically not best practice. plan() should really only be called at the user level. Users should control whether or not the function runs in parallel, and the default should be to run sequentially.

It is right. The problem is that it is running sequentially when called through the package, even though I set plan(multisession) in the package function. But if I don't put it in a package then it correctly runs in parallel.
I will try to upload the package to github. It is better that you have gatk installed.

If I put the plan() outside the R package, it still cannot run in parallel.

I have made an R package at:

https://github.com/jinyancool/fakepkg

The function is:

test_furrr <- function(){
intervals <- seq(1,60)
oplan <- plan(multisession, workers = 60)
on.exit(plan(oplan), add = TRUE)
future_walk(intervals, ~ run_fun(.x))
}

You will find run: fakepkg::test_furrr() and paste test_furrr() script in terminal, then run directly are quite different.

library(tictoc)
tic()
test_furrr()
toc()
25.259 sec elapsed

tic()
fakepkg::test_furrr()
toc()
157.118 sec elapsed

Can we solve this issue now? Thanks.

Closing due to inability to reproduce