quanteda/quanteda.textmodels

textmodel_NB prediction with "bernoulli" is slow and limited

Closed this issue · 5 comments

I noticed that predict for a textmodel_NB with distribution = "Bernoulli" does not perform as good as that with distribution = "multinomial".

  1. it is significantly slow (about 100 times slower)
> small_data
Document-feature matrix of: 500 documents, 21,404 features (99.2% sparse).
> microbenchmark::microbenchmark(multinomial = {predict(nb_hashtag_mention_multinomial, small_data)},
+                                bernoulli = {predict(nb_hashtag_mention_bernoulli, small_data)}, 
+                                times = 10)
Unit: milliseconds
        expr        min         lq      mean     median         uq        max neval
 multinomial   26.88236   37.56514   53.8817   48.40566   64.68182   110.5487    10
   bernoulli 1932.85757 2580.22634 5984.4403 3542.37562 4571.23802 17777.3264    10
  1. it returns an error when a dataset is large
> hashtag_mention_all_accounts
Document-feature matrix of: 3,662,575 documents, 21,404 features (100% sparse).
> predict_all_users_multinomial <- predict(nb_hashtag_mention_multinomial, hashtag_mention_all_accounts)
> predict_all_users_bernoulli <- predict(nb_hashtag_mention_bernoulli, hashtag_mention_all_accounts)
Error in C2dense(x) : 
  Cholmod error 'problem too large' at file ../Core/cholmod_dense.c, line 105

Here is sessionInfo()

R version 3.3.3 (2017-03-06)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X El Capitan 10.11.6

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] brexithelper_0.1.0 magrittr_1.5       data.table_1.10.4  ggplot2_2.2.1.9000 quanteda_0.9.9.72 

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.11           knitr_1.15.2           devtools_1.12.0.9000   pkgload_0.0.0.9000     munsell_0.4.3         
 [6] colorspace_1.3-2       lattice_0.20-34        R6_2.2.0               rlang_0.1.1            fastmatch_1.1-0       
[11] plyr_1.8.4             httr_1.2.1             tools_3.3.3            pkgbuild_0.0.0.9000    grid_3.3.3            
[16] gtable_0.2.0           git2r_0.16.0           withr_1.0.2            lazyeval_0.2.0         RcppParallel_4.3.20   
[21] digest_0.6.12          tibble_1.3.3           Matrix_1.2-8           callr_1.0.0.9000       microbenchmark_1.4-2.1
[26] curl_2.3               memoise_1.0.0          stringi_1.1.5          scales_0.4.1          

Is there any tricks to improve the performance?

Something is making it dense in the code then. I'll put it on the list to fix. Thanks!

I think the problems are in the predict() method. Can you supply me the objects, or try it with the fit and predict as separate steps?

I found this package by accident, but the code looks nice.
https://github.com/mskogholt/fastNaiveBayes

There's also naivebayes https://cran.r-project.org/web/packages/naivebayes/index.html

but neither that nor fastNaiveBayes work on sparse objects, I think. I will test though, thanks!