textmodel_NB prediction with "bernoulli" is slow and limited
Closed this issue · 5 comments
I noticed that predict
for a textmodel_NB with distribution = "Bernoulli"
does not perform as good as that with distribution = "multinomial"
.
- it is significantly slow (about 100 times slower)
> small_data
Document-feature matrix of: 500 documents, 21,404 features (99.2% sparse).
> microbenchmark::microbenchmark(multinomial = {predict(nb_hashtag_mention_multinomial, small_data)},
+ bernoulli = {predict(nb_hashtag_mention_bernoulli, small_data)},
+ times = 10)
Unit: milliseconds
expr min lq mean median uq max neval
multinomial 26.88236 37.56514 53.8817 48.40566 64.68182 110.5487 10
bernoulli 1932.85757 2580.22634 5984.4403 3542.37562 4571.23802 17777.3264 10
- it returns an error when a dataset is large
> hashtag_mention_all_accounts
Document-feature matrix of: 3,662,575 documents, 21,404 features (100% sparse).
> predict_all_users_multinomial <- predict(nb_hashtag_mention_multinomial, hashtag_mention_all_accounts)
> predict_all_users_bernoulli <- predict(nb_hashtag_mention_bernoulli, hashtag_mention_all_accounts)
Error in C2dense(x) :
Cholmod error 'problem too large' at file ../Core/cholmod_dense.c, line 105
Here is sessionInfo()
R version 3.3.3 (2017-03-06)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X El Capitan 10.11.6
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] brexithelper_0.1.0 magrittr_1.5 data.table_1.10.4 ggplot2_2.2.1.9000 quanteda_0.9.9.72
loaded via a namespace (and not attached):
[1] Rcpp_0.12.11 knitr_1.15.2 devtools_1.12.0.9000 pkgload_0.0.0.9000 munsell_0.4.3
[6] colorspace_1.3-2 lattice_0.20-34 R6_2.2.0 rlang_0.1.1 fastmatch_1.1-0
[11] plyr_1.8.4 httr_1.2.1 tools_3.3.3 pkgbuild_0.0.0.9000 grid_3.3.3
[16] gtable_0.2.0 git2r_0.16.0 withr_1.0.2 lazyeval_0.2.0 RcppParallel_4.3.20
[21] digest_0.6.12 tibble_1.3.3 Matrix_1.2-8 callr_1.0.0.9000 microbenchmark_1.4-2.1
[26] curl_2.3 memoise_1.0.0 stringi_1.1.5 scales_0.4.1
Is there any tricks to improve the performance?
Something is making it dense in the code then. I'll put it on the list to fix. Thanks!
I think the problems are in the predict()
method. Can you supply me the objects, or try it with the fit and predict as separate steps?
I found this package by accident, but the code looks nice.
https://github.com/mskogholt/fastNaiveBayes
There's also naivebayes https://cran.r-project.org/web/packages/naivebayes/index.html
but neither that nor fastNaiveBayes work on sparse objects, I think. I will test though, thanks!
I though this would be good becasue of the sparse
argument.
https://github.com/mskogholt/fastNaiveBayes/blob/012eafd7685aed208faf1ebcacfdb632db84beb6/R/fastNaiveBayes.R#L4