textmodel_NB prediction with "bernoulli" is slow and limited

Question

textmodel_NB prediction with "bernoulli" is slow and limited

Closed this issue 5 years ago · 5 comments

I noticed that predict for a textmodel_NB with distribution = "Bernoulli" does not perform as good as that with distribution = "multinomial".

it is significantly slow (about 100 times slower)

> small_data
Document-feature matrix of: 500 documents, 21,404 features (99.2% sparse).
> microbenchmark::microbenchmark(multinomial = {predict(nb_hashtag_mention_multinomial, small_data)},
+                                bernoulli = {predict(nb_hashtag_mention_bernoulli, small_data)}, 
+                                times = 10)
Unit: milliseconds
        expr        min         lq      mean     median         uq        max neval
 multinomial   26.88236   37.56514   53.8817   48.40566   64.68182   110.5487    10
   bernoulli 1932.85757 2580.22634 5984.4403 3542.37562 4571.23802 17777.3264    10

it returns an error when a dataset is large

> hashtag_mention_all_accounts
Document-feature matrix of: 3,662,575 documents, 21,404 features (100% sparse).
> predict_all_users_multinomial <- predict(nb_hashtag_mention_multinomial, hashtag_mention_all_accounts)
> predict_all_users_bernoulli <- predict(nb_hashtag_mention_bernoulli, hashtag_mention_all_accounts)
Error in C2dense(x) : 
  Cholmod error 'problem too large' at file ../Core/cholmod_dense.c, line 105

Here is sessionInfo()

R version 3.3.3 (2017-03-06)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X El Capitan 10.11.6

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] brexithelper_0.1.0 magrittr_1.5       data.table_1.10.4  ggplot2_2.2.1.9000 quanteda_0.9.9.72 

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.11           knitr_1.15.2           devtools_1.12.0.9000   pkgload_0.0.0.9000     munsell_0.4.3         
 [6] colorspace_1.3-2       lattice_0.20-34        R6_2.2.0               rlang_0.1.1            fastmatch_1.1-0       
[11] plyr_1.8.4             httr_1.2.1             tools_3.3.3            pkgbuild_0.0.0.9000    grid_3.3.3            
[16] gtable_0.2.0           git2r_0.16.0           withr_1.0.2            lazyeval_0.2.0         RcppParallel_4.3.20   
[21] digest_0.6.12          tibble_1.3.3           Matrix_1.2-8           callr_1.0.0.9000       microbenchmark_1.4-2.1
[26] curl_2.3               memoise_1.0.0          stringi_1.1.5          scales_0.4.1

Is there any tricks to improve the performance?

Answer 1 · 2017-06-19T12:42:42.000Z

Something is making it dense in the code then. I'll put it on the list to fix. Thanks!

Answer 2 · 2017-06-19T13:34:53.000Z

I think the problems are in the predict() method. Can you supply me the objects, or try it with the fit and predict as separate steps?

Answer 3 · 2020-02-26T06:27:38.000Z

I found this package by accident, but the code looks nice.
https://github.com/mskogholt/fastNaiveBayes

Answer 4 · 2020-02-26T07:32:18.000Z

There's also naivebayes https://cran.r-project.org/web/packages/naivebayes/index.html

but neither that nor fastNaiveBayes work on sparse objects, I think. I will test though, thanks!

Answer 5 · 2020-02-26T10:16:55.000Z

I though this would be good becasue of the sparse argument.
https://github.com/mskogholt/fastNaiveBayes/blob/012eafd7685aed208faf1ebcacfdb632db84beb6/R/fastNaiveBayes.R#L4