RunPresto Error "object p_val not found" on large datasets
mihem opened this issue · 4 comments
RunPresto is great because it's super fast. When I wanted to run that on a pretty large dataset (~ 73 000 cells), I got this error:
SeuratWrappers::RunPresto(my_dataset, ident.1 = "0")
Error in data.frame(p_val, row.names = rownames(x = data.use)) : object 'p_val' not found
I played around a little, and when I downsized the dataset to around 40 000 cells by subsetting, the above command runs without problems. I could also reproduce that on a different large dataset (around 80 000 cells). FindMarkers runs without problems by the way. That's a pity because RunPresto is especially useful in these large datasets.
Any idea what the problem could be?
@jaisonj708 maybe?
Thank you!
Edit:
To improve reproducibility, here is a minimal working example using the datasets of SeuratData.
library(Seurat)
library(SeuratData)
library(SeuratWrappers)
SeuratData::InstallData("hcabm40k")
SeuratData::InstallData("ifnb")
de1 <- SeuratWrappers::RunPresto(hcabm40k, ident.1 = "MantonBM1", ident.2 = NULL)
de2 <- SeuratWrappers::RunPresto(ifnb, ident.1 = "IMMUNE_CTRL", ident.2 = NULL)
large_dataset <- merge(hcabm40k, ifnb)
de3 <-SeuratWrappers::RunPresto(large_dataset, ident.1 = "IMMUNE_CTRL", ident.2 = NULL)
de1
and de2
are the expected results, when trying to get de3, I get Error in data.frame(p_val, row.names = rownames(x = data.use)) : object 'p_val' not found
session info
R version 4.0.5 (2021-03-31)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Arch Linux
Matrix products: default
BLAS: /usr/lib/libopenblasp-r0.3.13.so
LAPACK: /usr/lib/liblapack.so.3.9.1
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] forcats_0.5.1 stringr_1.4.0 dplyr_1.0.5 purrr_0.3.4
[5] readr_1.4.0 tidyr_1.1.3 tibble_3.1.1 ggplot2_3.3.3.9000
[9] tidyverse_1.3.0 SeuratObject_4.0.0 Seurat_4.0.1
loaded via a namespace (and not attached):
[1] Rtsne_0.15 colorspace_2.0-0 deldir_0.2-10
[4] ellipsis_0.3.1 ggridges_0.5.3 fs_1.5.0
[7] rstudioapi_0.13 spatstat.data_2.1-0 leiden_0.3.7
[10] listenv_0.8.0 remotes_2.3.0 ggrepel_0.9.1
[13] fansi_0.4.2 lubridate_1.7.10 xml2_1.3.2
[16] codetools_0.2-18 splines_4.0.5 polyclip_1.10-0
[19] jsonlite_1.7.2 broom_0.7.6 ica_1.0-2
[22] cluster_2.1.1 dbplyr_2.1.1 png_0.1-7
[25] uwot_0.1.10 shiny_1.6.0 sctransform_0.3.2
[28] spatstat.sparse_2.0-0 BiocManager_1.30.12 compiler_4.0.5
[31] httr_1.4.2 SeuratWrappers_0.3.0 backports_1.2.1
[34] assertthat_0.2.1 Matrix_1.3-2 fastmap_1.1.0
[37] lazyeval_0.2.2 limma_3.46.0 cli_2.4.0
[40] later_1.1.0.1 htmltools_0.5.1.1 tools_4.0.5
[43] rsvd_1.0.5 igraph_1.2.6 gtable_0.3.0
[46] glue_1.4.2 RANN_2.6.1 reshape2_1.4.4
[49] Rcpp_1.0.6 scattermore_0.7 cellranger_1.1.0
[52] presto_1.0.0 vctrs_0.3.7 nlme_3.1-152
[55] lmtest_0.9-38 ps_1.6.0 globals_0.14.0
[58] rvest_1.0.0 mime_0.10 miniUI_0.1.1.1
[61] lifecycle_1.0.0 irlba_2.3.3 goftest_1.2-2
[64] future_1.21.0 MASS_7.3-53.1 zoo_1.8-9
[67] scales_1.1.1 spatstat.core_2.1-2 hms_1.0.0
[70] promises_1.2.0.1 spatstat.utils_2.1-0 parallel_4.0.5
[73] RColorBrewer_1.1-2 qs_0.24.1 reticulate_1.18
[76] pbapply_1.4-3 gridExtra_2.3 rpart_4.1-15
[79] stringi_1.5.3 rlang_0.4.10 pkgconfig_2.0.3
[82] matrixStats_0.58.0 lattice_0.20-41 ROCR_1.0-11
[85] tensor_1.5 patchwork_1.1.1 htmlwidgets_1.5.3
[88] cowplot_1.1.1 tidyselect_1.1.0 parallelly_1.24.0
[91] RcppAnnoy_0.0.18 plyr_1.8.6 magrittr_2.0.1
[94] R6_2.5.0 generics_0.1.0 DBI_1.1.1
[97] pillar_1.6.0 haven_2.4.0 withr_2.4.2
[100] mgcv_1.8-34 fitdistrplus_1.1-3 survival_3.2-10
[103] abind_1.4-5 future.apply_1.7.0 modelr_0.1.8
[106] crayon_1.4.1 KernSmooth_2.23-18 utf8_1.2.1
[109] RApiSerialize_0.1.0 spatstat.geom_2.1-0 plotly_4.9.3
[112] grid_4.0.5 readxl_1.3.1 data.table_1.14.0
[115] reprex_2.0.0 digest_0.6.27 xtable_1.8-4
[118] httpuv_1.5.5 RcppParallel_5.1.2 stringfish_0.15.1
[121] munsell_0.5.0 viridisLite_0.4.0
Thanks for bringing this up.
After removing the overflow check (which I now think is unnecessary), this issue seems to be resolved in my latest branch. It is still in the review process (#88) but feel free to check out jaisonj708:feat/presto_updates
This works on large datasets, including the example you gave but please let me know if you have any further issues on your end.
@jaisonj708 Great, thanks a lot, that was an easy solution. Works fine with your lastest changes.
Concerning overflow check: I think the amount of memory needed is much smaller than using Seurat::FindMarkers function. Therefore, I think it's fine removing the overflow part completely as you did (one could also think about an option such as limitsize = FALSE in ggplot2).
FindMarkers is one of the most time-consuming steps of the entire scRNA-seq data anlysis and presto is so much faster, it would be great to have it part of the default Seurat::FindMarkers. I think there are many people who would like to have this performance improvement, but just don't know about RunPresto because it's well hidden right know ;) .
Hi @mihem,
This PR is merged now so the latest seurat-wrappers should also work. If/when presto goes to an official repository (e.g. CRAN or Bioconductor), we will likely add it as an option in FindMarkers
as well.
Sorry to necro bump.
But thanks to @AustinHartman presto is now on CRAN. So @andrewwbutler maybe RunPresto
, which produces the same results as FindMarkers just way faster, could be now added as an option to FindMarkers
as you proposed?
Me and many users would appreciate that I think.