satijalab/seurat-wrappers

RunPresto Error "object p_val not found" on large datasets

mihem opened this issue · 4 comments

mihem commented

RunPresto is great because it's super fast. When I wanted to run that on a pretty large dataset (~ 73 000 cells), I got this error:

SeuratWrappers::RunPresto(my_dataset, ident.1 = "0")

Error in data.frame(p_val, row.names = rownames(x = data.use)) : object 'p_val' not found

I played around a little, and when I downsized the dataset to around 40 000 cells by subsetting, the above command runs without problems. I could also reproduce that on a different large dataset (around 80 000 cells). FindMarkers runs without problems by the way. That's a pity because RunPresto is especially useful in these large datasets.

Any idea what the problem could be?
@jaisonj708 maybe?

Thank you!

Edit:
To improve reproducibility, here is a minimal working example using the datasets of SeuratData.

library(Seurat)
library(SeuratData)
library(SeuratWrappers)

SeuratData::InstallData("hcabm40k")
SeuratData::InstallData("ifnb")

de1 <- SeuratWrappers::RunPresto(hcabm40k, ident.1 = "MantonBM1", ident.2 = NULL)
de2 <- SeuratWrappers::RunPresto(ifnb, ident.1 = "IMMUNE_CTRL", ident.2 = NULL)

large_dataset <- merge(hcabm40k, ifnb)

de3 <-SeuratWrappers::RunPresto(large_dataset, ident.1 = "IMMUNE_CTRL", ident.2 = NULL)

de1 and de2 are the expected results, when trying to get de3, I get Error in data.frame(p_val, row.names = rownames(x = data.use)) : object 'p_val' not found

session info
R version 4.0.5 (2021-03-31)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Arch Linux

Matrix products: default
BLAS:   /usr/lib/libopenblasp-r0.3.13.so
LAPACK: /usr/lib/liblapack.so.3.9.1

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] forcats_0.5.1      stringr_1.4.0      dplyr_1.0.5        purrr_0.3.4       
 [5] readr_1.4.0        tidyr_1.1.3        tibble_3.1.1       ggplot2_3.3.3.9000
 [9] tidyverse_1.3.0    SeuratObject_4.0.0 Seurat_4.0.1      

loaded via a namespace (and not attached):
  [1] Rtsne_0.15            colorspace_2.0-0      deldir_0.2-10        
  [4] ellipsis_0.3.1        ggridges_0.5.3        fs_1.5.0             
  [7] rstudioapi_0.13       spatstat.data_2.1-0   leiden_0.3.7         
 [10] listenv_0.8.0         remotes_2.3.0         ggrepel_0.9.1        
 [13] fansi_0.4.2           lubridate_1.7.10      xml2_1.3.2           
 [16] codetools_0.2-18      splines_4.0.5         polyclip_1.10-0      
 [19] jsonlite_1.7.2        broom_0.7.6           ica_1.0-2            
 [22] cluster_2.1.1         dbplyr_2.1.1          png_0.1-7            
 [25] uwot_0.1.10           shiny_1.6.0           sctransform_0.3.2    
 [28] spatstat.sparse_2.0-0 BiocManager_1.30.12   compiler_4.0.5       
 [31] httr_1.4.2            SeuratWrappers_0.3.0  backports_1.2.1      
 [34] assertthat_0.2.1      Matrix_1.3-2          fastmap_1.1.0        
 [37] lazyeval_0.2.2        limma_3.46.0          cli_2.4.0            
 [40] later_1.1.0.1         htmltools_0.5.1.1     tools_4.0.5          
 [43] rsvd_1.0.5            igraph_1.2.6          gtable_0.3.0         
 [46] glue_1.4.2            RANN_2.6.1            reshape2_1.4.4       
 [49] Rcpp_1.0.6            scattermore_0.7       cellranger_1.1.0     
 [52] presto_1.0.0          vctrs_0.3.7           nlme_3.1-152         
 [55] lmtest_0.9-38         ps_1.6.0              globals_0.14.0       
 [58] rvest_1.0.0           mime_0.10             miniUI_0.1.1.1       
 [61] lifecycle_1.0.0       irlba_2.3.3           goftest_1.2-2        
 [64] future_1.21.0         MASS_7.3-53.1         zoo_1.8-9            
 [67] scales_1.1.1          spatstat.core_2.1-2   hms_1.0.0            
 [70] promises_1.2.0.1      spatstat.utils_2.1-0  parallel_4.0.5       
 [73] RColorBrewer_1.1-2    qs_0.24.1             reticulate_1.18      
 [76] pbapply_1.4-3         gridExtra_2.3         rpart_4.1-15         
 [79] stringi_1.5.3         rlang_0.4.10          pkgconfig_2.0.3      
 [82] matrixStats_0.58.0    lattice_0.20-41       ROCR_1.0-11          
 [85] tensor_1.5            patchwork_1.1.1       htmlwidgets_1.5.3    
 [88] cowplot_1.1.1         tidyselect_1.1.0      parallelly_1.24.0    
 [91] RcppAnnoy_0.0.18      plyr_1.8.6            magrittr_2.0.1       
 [94] R6_2.5.0              generics_0.1.0        DBI_1.1.1            
 [97] pillar_1.6.0          haven_2.4.0           withr_2.4.2          
[100] mgcv_1.8-34           fitdistrplus_1.1-3    survival_3.2-10      
[103] abind_1.4-5           future.apply_1.7.0    modelr_0.1.8         
[106] crayon_1.4.1          KernSmooth_2.23-18    utf8_1.2.1           
[109] RApiSerialize_0.1.0   spatstat.geom_2.1-0   plotly_4.9.3         
[112] grid_4.0.5            readxl_1.3.1          data.table_1.14.0    
[115] reprex_2.0.0          digest_0.6.27         xtable_1.8-4         
[118] httpuv_1.5.5          RcppParallel_5.1.2    stringfish_0.15.1    
[121] munsell_0.5.0         viridisLite_0.4.0 

Thanks for bringing this up.

After removing the overflow check (which I now think is unnecessary), this issue seems to be resolved in my latest branch. It is still in the review process (#88) but feel free to check out jaisonj708:feat/presto_updates

This works on large datasets, including the example you gave but please let me know if you have any further issues on your end.

mihem commented

@jaisonj708 Great, thanks a lot, that was an easy solution. Works fine with your lastest changes.
Concerning overflow check: I think the amount of memory needed is much smaller than using Seurat::FindMarkers function. Therefore, I think it's fine removing the overflow part completely as you did (one could also think about an option such as limitsize = FALSE in ggplot2).

FindMarkers is one of the most time-consuming steps of the entire scRNA-seq data anlysis and presto is so much faster, it would be great to have it part of the default Seurat::FindMarkers. I think there are many people who would like to have this performance improvement, but just don't know about RunPresto because it's well hidden right know ;) .

Hi @mihem,

This PR is merged now so the latest seurat-wrappers should also work. If/when presto goes to an official repository (e.g. CRAN or Bioconductor), we will likely add it as an option in FindMarkers as well.

mihem commented

Sorry to necro bump.
But thanks to @AustinHartman presto is now on CRAN. So @andrewwbutler maybe RunPresto, which produces the same results as FindMarkers just way faster, could be now added as an option to FindMarkers as you proposed?
Me and many users would appreciate that I think.