ropensci/textreuse

Error when using functions from tokenizers package

Closed this issue · 2 comments

I seem to be getting an error when trying to call one of the tokenizer functions from the tokenizers package. Any idea why this might be going on?

library(textreuse)

test_corpus_dir <- paste0(system.file(package = "textreuse"), "/extdata/ats/")

test_corpus <- TextReuseCorpus(dir = test_corpus_dir, tokenizer = tokenizers::tokenize_character_shingles)
#> Error in hash_func(tokens): expecting a string vector

And the traceback:

6.
stop(structure(list(message = "expecting a string vector", call = hash_func(tokens), 
    cppstack = NULL), .Names = c("message", "call", "cppstack"
), class = c("Rcpp::not_compatible", "C++Error", "error", "condition"
))) 
5.
hash_func(tokens) 
4.
TextReuseTextDocument(file = paths[i], tokenizer = tokenizer, 
    ..., hash_func = hash_func, minhash_func = minhash_func, 
    keep_tokens = keep_tokens, keep_text = keep_text, skip_short = skip_short, 
    meta = list(tokenizer = tokenizer_name, hash_func = hash_func_name,  ... 
3.
FUN(X[[i]], ...) 
2.
apply_func(seq_along(paths), function(i) {
    d <- TextReuseTextDocument(file = paths[i], tokenizer = tokenizer, 
        ..., hash_func = hash_func, minhash_func = minhash_func, 
        keep_tokens = keep_tokens, keep_text = keep_text, skip_short = skip_short,  ... 
1.
TextReuseCorpus(dir = test_corpus_dir, tokenizer = tokenizers::tokenize_character_shingles) 
Session info
devtools::session_info()
#> Session info --------------------------------------------------------------
#>  setting  value                       
#>  version  R version 3.3.1 (2016-06-21)
#>  system   x86_64, darwin15.5.0        
#>  ui       unknown                     
#>  language (EN)                        
#>  collate  en_US.UTF-8                 
#>  tz       America/Los_Angeles         
#>  date     2017-04-07
#> Packages ------------------------------------------------------------------
#>  package      * version date       source        
#>  assertthat     0.1     2013-12-06 CRAN (R 3.3.1)
#>  backports      1.0.5   2017-01-18 CRAN (R 3.3.1)
#>  devtools       1.12.0  2016-12-05 CRAN (R 3.3.1)
#>  digest         0.6.12  2017-01-27 CRAN (R 3.3.1)
#>  evaluate       0.10    2016-10-11 CRAN (R 3.3.1)
#>  htmltools      0.3.5   2016-03-21 CRAN (R 3.3.1)
#>  knitr          1.15.1  2016-11-22 CRAN (R 3.3.1)
#>  magrittr       1.5     2014-11-22 CRAN (R 3.3.1)
#>  memoise        1.0.0   2016-01-29 CRAN (R 3.3.1)
#>  NLP            0.1-10  2017-02-21 CRAN (R 3.3.1)
#>  Rcpp           0.12.10 2017-03-19 CRAN (R 3.3.1)
#>  RcppProgress   0.3     2017-01-05 CRAN (R 3.3.1)
#>  rmarkdown      1.4     2017-03-24 CRAN (R 3.3.1)
#>  rprojroot      1.2     2017-01-16 CRAN (R 3.3.1)
#>  SnowballC      0.5.1   2014-08-09 CRAN (R 3.3.1)
#>  stringi        1.1.3   2017-03-21 CRAN (R 3.3.1)
#>  stringr        1.2.0   2017-02-18 CRAN (R 3.3.1)
#>  textreuse    * 0.1.4   2016-11-28 CRAN (R 3.3.1)
#>  tokenizers     0.1.4   2016-08-29 CRAN (R 3.3.1)
#>  withr          1.0.2   2016-06-20 CRAN (R 3.3.1)
#>  yaml           2.1.14  2016-11-12 CRAN (R 3.3.1)

You need to specify the simplify = TRUE argument so that it gets passed to the tokenizer.

test_corpus <- TextReuseCorpus(dir = test_corpus_dir, tokenizer = tokenizers::tokenize_character_shingles, simplify = TRUE)

Let me know if that solves your problem.

The textreuse package shipped with its own tokenizers which had a different interface. (The tokenization functions in textreuse return a character vector and can only take a single document; the functions in the tokenizers package can take multiple input documents and always return a list unless simplify = TRUE.) I haven't updated textreuse yet to take advantage of the tokenizers package, so you aren't getting any of the speed advantages yet. Updating textreuse to be compatible with tokenizers is up next after I release the next version of tokenizers. That will be an interim release until I can rewrite the whole package to take advantage of a more sensible corpus object and can reimplement some of the functions in more efficient matrix-based ways.

Yes, simplify = TRUE fixes this problem! Thanks. I hadn't thought to check that the output class for tokenizers package functions might vary depending on input type.

At the moment we're not dealing with corpus sizes that present any speed challenges, but I do look forward to the coming updates for these packages.