How to emulate functionality from NMF package for estimating the factorization rank
Closed this issue ยท 8 comments
I am interested in using this package to run an NMF analysis on some sparse data. The NMF R package from CRAN, provides the NMF rank survey functionality when using a range of values for k. I was wondering whether it's possible to obtain the sparseness/cophenetic/silhouette metrics per rank from RcppML::nmf()?
@wudustan try using the dev version from GitHub and check out the crossValidate
function. This function applies Wold's Monte Carlo cross-validation to determine the optimal rank -- it is much better than any of cophenetic/silhouette metrics for determining rank. Let me know if you have questions on how to use that function!
I'm getting the following error:
Error in Rcpp_nmf_dense(data, mask_matrix, tol, maxit, getOption("RcppML.verbose"), :
c++ exception (unknown reason)
Could you suggest a reason?
My data:
> dim(b169.dmso.frozen.nmf.mat)
[1] 3000 5907
> b169.dmso.frozen.nmf.mat[1:5, 1:5]
B169.DMSO.FROZEN_1 B169.DMSO.FROZEN_3 B169.DMSO.FROZEN_5 B169.DMSO.FROZEN_7 B169.DMSO.FROZEN_8
VIM 4.346295 7.397843 8.200908 5.429212 8.478549
HIST1H4C 9.237470 6.009916 8.941542 5.770367 6.146523
UBE2C 7.773624 2.964877 7.697491 10.865464 10.649683
CENPF 6.568869 5.974733 7.942087 9.413728 9.651910
HIST1H1E 6.746004 6.719570 7.642481 1.937257 5.363575
> class(b169.dmso.frozen.nmf.mat)
[1] "matrix" "array"
The function call:
b169.dmso.frozen.ranks <- crossValidate(b169.dmso.frozen.nmf.mat, k = 2:8, verbose = TRUE)
GCC:
more ~/.R/Makevars
VER=-12
CC=gcc$(VER)
CXX=g++$(VER)
CXX11=g++$(VER)
CXX14=g++$(VER)
CXX17=g++$(VER)
CFLAGS=-mtune=native -g -O2 -Wall -pedantic -Wconversion
CXXFLAGS=-mtune=native -g -O2 -Wall -pedantic -Wconversion
FLIBS=-L/usr/local/Cellar/gcc/12.2.0/lib/gcc/12
Session info:
R version 4.2.2 (2022-10-31)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Monterey 12.6.2
Matrix products: default
LAPACK: /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRlapack.dylib
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats4 stats graphics grDevices utils datasets methods base
other attached packages:
[1] RcppML_0.5.6 pagoda2_1.0.10 igraph_1.3.5 Matrix_1.5-3
[5] reshape2_1.4.4 EnsDb.Hsapiens.v75_2.99.0 ensembldb_2.20.2 AnnotationFilter_1.20.0
[9] GenomicFeatures_1.48.4 AnnotationDbi_1.58.0 kneedle_1.0.0 readxl_1.4.1
[13] UCell_2.0.1 hues_0.2.0.9000 patchwork_1.1.2 schex_1.10.0
[17] shiny_1.7.4 cowplot_1.1.1 scater_1.24.0 scuttle_1.6.3
[21] SingleCellExperiment_1.18.1 harmony_0.1.1 Rcpp_1.0.9 SeuratWrappers_0.3.1
[25] Seurat_4.3.0 SeuratObject_4.1.3 sp_1.5-1 dplyr_1.0.10
[29] ggplot2_3.4.0 stringr_1.5.0 ShortRead_1.54.0 GenomicAlignments_1.32.1
[33] SummarizedExperiment_1.26.1 Biobase_2.56.0 MatrixGenerics_1.8.1 matrixStats_0.63.0
[37] Rsamtools_2.12.0 GenomicRanges_1.48.0 Biostrings_2.64.1 GenomeInfoDb_1.32.4
[41] XVector_0.36.0 IRanges_2.30.1 S4Vectors_0.34.0 BiocParallel_1.30.4
[45] BiocGenerics_0.42.0
I am not able to replicate using random dense matrices. I'm wondering if there is something specific to your matrix that is causing this?
set.seed(123)
A <- matrix(runif(300 * 500), 300, 500)
model <- RcppML::crossValidate(A, 2:8, verbose = TRUE)
I am not sure if you can provide a reproducible example somehow?
This is very weird. At the office it wasn't running. I got home and saw your message, gave it another try and now it's running fine. The only change is I opened a new R session but restarting R was the first thing i tried when I got the c++ error. Thank you for taking the time to try and replicate. I'm really not sure what the issue was.
The speed of this is very impressive. Could you give me some info on how to interpret the output of the crossValidate() I ran it for 2:20 ranks.
Sure, glad it works for you. The optimal rank appears to be somewhere around k = 15
. I'd do + scale_y_continuous(limits = c(1, 1.1)
and that might help you get a better idea. Those last outliers are screwing with the visualization a bit.
No, you want k = 12
.
Thanks for your help.