zdebruine/RcppML

How to emulate functionality from NMF package for estimating the factorization rank

Closed this issue ยท 8 comments

I am interested in using this package to run an NMF analysis on some sparse data. The NMF R package from CRAN, provides the NMF rank survey functionality when using a range of values for k. I was wondering whether it's possible to obtain the sparseness/cophenetic/silhouette metrics per rank from RcppML::nmf()?

@wudustan try using the dev version from GitHub and check out the crossValidate function. This function applies Wold's Monte Carlo cross-validation to determine the optimal rank -- it is much better than any of cophenetic/silhouette metrics for determining rank. Let me know if you have questions on how to use that function!

I'm getting the following error:

Error in Rcpp_nmf_dense(data, mask_matrix, tol, maxit, getOption("RcppML.verbose"),  : 
  c++ exception (unknown reason)

Could you suggest a reason?

My data:

> dim(b169.dmso.frozen.nmf.mat)
[1] 3000 5907

> b169.dmso.frozen.nmf.mat[1:5, 1:5]
         B169.DMSO.FROZEN_1 B169.DMSO.FROZEN_3 B169.DMSO.FROZEN_5 B169.DMSO.FROZEN_7 B169.DMSO.FROZEN_8
VIM                4.346295           7.397843           8.200908           5.429212           8.478549
HIST1H4C           9.237470           6.009916           8.941542           5.770367           6.146523
UBE2C              7.773624           2.964877           7.697491          10.865464          10.649683
CENPF              6.568869           5.974733           7.942087           9.413728           9.651910
HIST1H1E           6.746004           6.719570           7.642481           1.937257           5.363575

> class(b169.dmso.frozen.nmf.mat)
[1] "matrix" "array" 

The function call:
b169.dmso.frozen.ranks <- crossValidate(b169.dmso.frozen.nmf.mat, k = 2:8, verbose = TRUE)

GCC:

more ~/.R/Makevars
VER=-12
CC=gcc$(VER)
CXX=g++$(VER)
CXX11=g++$(VER)
CXX14=g++$(VER)
CXX17=g++$(VER)
CFLAGS=-mtune=native -g -O2 -Wall -pedantic -Wconversion
CXXFLAGS=-mtune=native -g -O2 -Wall -pedantic -Wconversion
FLIBS=-L/usr/local/Cellar/gcc/12.2.0/lib/gcc/12

Session info:

R version 4.2.2 (2022-10-31)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Monterey 12.6.2

Matrix products: default
LAPACK: /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:

[1] RcppML_0.5.6                pagoda2_1.0.10              igraph_1.3.5                Matrix_1.5-3               
 [5] reshape2_1.4.4              EnsDb.Hsapiens.v75_2.99.0   ensembldb_2.20.2            AnnotationFilter_1.20.0    
 [9] GenomicFeatures_1.48.4      AnnotationDbi_1.58.0        kneedle_1.0.0               readxl_1.4.1               
[13] UCell_2.0.1                 hues_0.2.0.9000             patchwork_1.1.2             schex_1.10.0               
[17] shiny_1.7.4                 cowplot_1.1.1               scater_1.24.0               scuttle_1.6.3              
[21] SingleCellExperiment_1.18.1 harmony_0.1.1               Rcpp_1.0.9                  SeuratWrappers_0.3.1       
[25] Seurat_4.3.0                SeuratObject_4.1.3          sp_1.5-1                    dplyr_1.0.10               
[29] ggplot2_3.4.0               stringr_1.5.0               ShortRead_1.54.0            GenomicAlignments_1.32.1   
[33] SummarizedExperiment_1.26.1 Biobase_2.56.0              MatrixGenerics_1.8.1        matrixStats_0.63.0         
[37] Rsamtools_2.12.0            GenomicRanges_1.48.0        Biostrings_2.64.1           GenomeInfoDb_1.32.4        
[41] XVector_0.36.0              IRanges_2.30.1              S4Vectors_0.34.0            BiocParallel_1.30.4        
[45] BiocGenerics_0.42.0        

I am not able to replicate using random dense matrices. I'm wondering if there is something specific to your matrix that is causing this?

set.seed(123)
A <- matrix(runif(300 * 500), 300, 500)
model <- RcppML::crossValidate(A, 2:8, verbose = TRUE)

I am not sure if you can provide a reproducible example somehow?

This is very weird. At the office it wasn't running. I got home and saw your message, gave it another try and now it's running fine. The only change is I opened a new R session but restarting R was the first thing i tried when I got the c++ error. Thank you for taking the time to try and replicate. I'm really not sure what the issue was.

The speed of this is very impressive. Could you give me some info on how to interpret the output of the crossValidate() I ran it for 2:20 ranks.

ffcdfdbb-c986-4b40-8d80-0ad0058d6f90

Sure, glad it works for you. The optimal rank appears to be somewhere around k = 15. I'd do + scale_y_continuous(limits = c(1, 1.1) and that might help you get a better idea. Those last outliers are screwing with the visualization a bit.

aede28df-c5c7-49f1-86d4-71008d8a53eb
Here it is constrained. Do you think it's worth going beyond 20 ranks?

No, you want k = 12.

Thanks for your help.