Memory Error with Clustering with Leiden algorithm matrix - When to use matrix vs igraph method?
WilliamMWei opened this issue · 8 comments
Hi,
Thanks for the tool.
I attempted to cluster 45,000 cells using Leiden algorithm, using default argument method = "matrix"
. However, I encountered a "memory issue". But. when I changed `method = "igraph", it ran fine.
In the help, it mentions to use igraph
method when we do not want to cast large dataset to dense matrix, so it seems it simply is to deal with large dataset. But, would you mind letting me know if there is other key difference between using igraph
vs matrix
methods in terms of the clustering results? And, when should I choose one vs the other?
Related post: scverse/scanpy#1053
I have also posted here: satijalab/seurat#7979
Thank you so much for your support!
pbmc_cd4_cxcr5posneg.data_filtergene_filtercell_list_IndividualDatasetMERGED <- Seurat::FindClusters(pbmc_cd4_cxcr5posneg.data_filtergene_filtercell_list_IndividualDatasetMERGED, algorithm = 4, resolution = 1.2)
Error in py_call_impl(callable, call_args$unnamed, call_args$named) :
MemoryError
Run `reticulate::py_last_error()` for details.
In addition: There were 12 warnings (use warnings() to see them)
> reticulate::py_last_error()
── Python Exception Message ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
MemoryError
── R Traceback ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
▆
1. ├─Seurat::FindClusters(...)
2. └─Seurat:::FindClusters.Seurat(...)
3. ├─Seurat::FindClusters(...)
4. └─Seurat:::FindClusters.default(...)
5. └─Seurat:::RunLeiden(...)
6. ├─leiden::leiden(...)
7. └─leiden:::leiden.matrix(...)
8. ├─leiden:::make_py_graph(object, weights = weights)
9. └─leiden:::make_py_graph.matrix(object, weights = weights)
10. ├─leiden:::make_py_object(object, weights = weights)
11. └─leiden:::make_py_object.matrix(object, weights = weights)
12. └─adj_mat_py$tolist()
13. └─reticulate:::py_call_impl(callable, call_args$unnamed, call_args$named)
> sessionInfo()
R version 4.3.1 (2023-06-16 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19045)
Matrix products: default
locale:
[1] LC_COLLATE=English_United Kingdom.utf8 LC_CTYPE=English_United Kingdom.utf8 LC_MONETARY=English_United Kingdom.utf8
[4] LC_NUMERIC=C LC_TIME=English_United Kingdom.utf8
time zone: Europe/London
tzcode source: internal
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] clustree_0.5.0 ggraph_2.1.0 ggplot2_3.4.4 reticulate_1.34.0 knitr_1.44 SeuratObject_4.1.4 Seurat_4.4.0
loaded via a namespace (and not attached):
[1] RColorBrewer_1.1-3 rstudioapi_0.15.0 jsonlite_1.8.7 magrittr_2.0.3 spatstat.utils_3.0-3 farver_2.1.1
[7] rmarkdown_2.25 fs_1.6.3 vctrs_0.6.4 ROCR_1.0-11 memoise_2.0.1 spatstat.explore_3.2-5
[13] rstatix_0.7.2 htmltools_0.5.6.1 usethis_2.2.2 broom_1.0.5 sctransform_0.4.1 parallelly_1.36.0
[19] KernSmooth_2.23-21 htmlwidgets_1.6.2 ica_1.0-3 plyr_1.8.9 plotly_4.10.3 zoo_1.8-12
[25] cachem_1.0.8 igraph_1.5.1 mime_0.12 lifecycle_1.0.3 pkgconfig_2.0.3 Matrix_1.6-1.1
[31] R6_2.5.1 fastmap_1.1.1 fitdistrplus_1.1-11 future_1.33.0 shiny_1.7.5.1 digest_0.6.33
[37] colorspace_2.1-0 patchwork_1.1.3 ps_1.7.5 rprojroot_2.0.3 tensor_1.5 irlba_2.3.5.1
[43] pkgload_1.3.3 ggpubr_0.6.0 labeling_0.4.3 progressr_0.14.0 fansi_1.0.5 spatstat.sparse_3.0-2
[49] httr_1.4.7 polyclip_1.10-6 abind_1.4-5 compiler_4.3.1 here_1.0.1 remotes_2.4.2.1
[55] withr_2.5.1 backports_1.4.1 viridis_0.6.4 carData_3.0-5 pkgbuild_1.4.2 ggforce_0.4.1
[61] ggsignif_0.6.4 MASS_7.3-60 rappdirs_0.3.3 sessioninfo_1.2.2 tools_4.3.1 lmtest_0.9-40
[67] httpuv_1.6.12 future.apply_1.11.0 goftest_1.2-3 glue_1.6.2 callr_3.7.3 nlme_3.1-162
[73] promises_1.2.1 grid_4.3.1 checkmate_2.2.0 Rtsne_0.16 cluster_2.1.4 reshape2_1.4.4
[79] generics_0.1.3 gtable_0.3.4 spatstat.data_3.0-3 tidyr_1.3.0 data.table_1.14.8 tidygraph_1.2.3
[85] sp_2.1-1 car_3.1-2 utf8_1.2.4 spatstat.geom_3.2-7 RcppAnnoy_0.0.21 ggrepel_0.9.4
[91] RANN_2.6.1 pillar_1.9.0 stringr_1.5.0 later_1.3.1 splines_4.3.1 tweenr_2.0.2
[97] dplyr_1.1.3 lattice_0.21-8 survival_3.5-5 deldir_1.0-9 tidyselect_1.2.0 miniUI_0.1.1.1
[103] pbapply_1.7-2 gridExtra_2.3 scattermore_1.2 xfun_0.40 graphlayouts_1.0.1 devtools_2.4.5
[109] matrixStats_1.0.0 stringi_1.7.12 lazyeval_0.2.2 yaml_2.3.7 evaluate_0.22 codetools_0.2-19
[115] tibble_3.2.1 BiocManager_1.30.22 cli_3.6.1 uwot_0.1.16 xtable_1.8-4 munsell_0.5.0
[121] processx_3.8.2 Rcpp_1.0.11 globals_0.16.2 spatstat.random_3.2-1 png_0.1-8 parallel_4.3.1
[127] ellipsis_0.3.2 prettyunits_1.2.0 profvis_0.3.8 urlchecker_1.0.1 listenv_0.9.0 viridisLite_0.4.2
[133] scales_1.2.1 ggridges_0.5.4 leiden_0.4.3 purrr_1.0.2 crayon_1.5.2 rlang_1.1.1
[139] cowplot_1.1.1
Using a matrix is not a feature of this library. It is entirely specific to the leiden R package, which will convert that matrix to a graph before doing any community detection.
Given what the leiden
package does, the claim in Seurat's documentation that the "matrix"
method is faster for small data seems rather strange ... maybe it has to do with inefficient transfer of data between R and Python.
Thanks @szhorvat -- just so I understand it correctly, did you mean specifying method="matrix"
or method="igraph"
does not really impact the resulting clusters but it is simply helpful for efficiency of how the data is processed before community detection? For example, with large dataset, specifying method=igraph will skip the conversion of the data to a dense matrix, thereby speeding up the whole clustering (community detection).
Presumably yes. But you need to discuss this with the packages that implemented these methods. This choice of methods does not come from the leidenalg Python package.
Thanks Szabolcs. Hopefully someone from Seurat will give some input.
Probably @TomKellyGenetics could bring some clues.
My opinion: maybe it's time to use igraph directly.
https://igraph.org/r/doc/cluster_leiden.html
My opinion: maybe it's time to use igraph directly.
https://igraph.org/r/doc/cluster_leiden.html
@SamGG the leiden package already does this by default for igraph objects, although limited parameters are supported compared to calling Python. This has been supported for over a year with the 0.4 version.
maybe it has to do with inefficient transfer of data between R and Python
@szhorvat that’s correct it does (reticulate supports dense matrices but not sparse matrices or igraph objects so igraph objects are passed as an edge list and recreated in Python). This only applies to older versions of the R package for the reasons discussed above so the comment in Seurat documentation is likely no longer relevant for users running igraph 1.2.7 and leiden 0.4.0 or later.
Thanks for this information and your feedback.
Thanks all for commenting in my absence! I believe all questions are addressed, so I'm closing this.