dselivanov/LSHR

benchmarks

Closed this issue · 7 comments

As far as I know, there is only one another package for LSH in R - textreuse. It is well documented and tested, but a little bit slower (about 5x):

# devtools::install_github('dselivanov/text2vec')
# devtools::install_github('dselivanov/LSHR')
# devtools::install_github('ropensci/textreuse')
library(textreuse)
library(text2vec)
library(LSHR)
library(doParallel)
library(data.table)
library(microbenchmark)

N_WORKER <- 4
registerDoParallel(N_WORKER)
options(mc.cores = N_WORKER)

data("movie_review")
txt <- tolower(movie_review$review)
hashfun_number <- 240
bands_number <- 10

system.time({
  minhash <- minhash_generator(n = hashfun_number, seed = 3552)
  corpus <- TextReuseCorpus(text = txt, tokenizer = tokenize_words, lowercase = FALSE, 
                            minhash_func = minhash, keep_tokens = F,
                            progress = FALSE)
})
#   user  system elapsed 
#  7.440   0.343   3.760 
system.time({
  buckets <- lsh(corpus, bands_number, progress = FALSE)
  candidates <- lsh_candidates(buckets)
})
#   user  system elapsed 
#  6.321   0.080   6.407 

candidates
#         a        b score
#     (chr)    (chr) (dbl)
#1 doc-1054 doc-1417    NA
#2  doc-106 doc-4994    NA
#3 doc-1084 doc-3462    NA
#4 doc-1291 doc-1356    NA
#5 doc-1615 doc-3846    NA
#6 doc-2805 doc-4763    NA

system.time({
  jobs <- txt %>% split_into(N_WORKER) %>% lapply(itoken, tokenizer = word_tokenizer)
  dtm <- create_dtm(jobs, hash_vectorizer())
})
#   user  system elapsed 
#  2.085   1.124   0.638 
system.time({
    sign_mat <- get_signature_matrix(dtm, hashfun_number = hashfun_number, measure = 'jaccard', seed = 12L)
    candidate_indices <- get_similar_pairs(sign_mat, bands_number, F)
})
#   user  system elapsed 
#  1.875   0.858   1.176 

candidate_indices
#    id1  id2
#1: 1291 1356
#2: 2805 4763
#3: 1084 3462
#4: 1054 1417
#5: 1615 3846

@lmullen, will happy to contribute to textreuse, but I think it will be difficult to integrate LSHR into textreuse. To accelerate, we need to use matrices and vectors instead of lists and other high-level data structures...

@dselivanov Thanks for bringing up the topic. I've been thinking about how to do the next version of textreuse. As you've probably figured out, I spent a lot of time making it work well with the NLP package. But now that text2vec is mature with the 0.3 version, and much faster than the other natural language packages in R, I'd like to redo textreuse on the basis of text2vec. I've also been following your same discussion with @TommyJones about textmineR here. It seems to me that textreuse should follow the same pattern: use text2vec as a base for data ingest and representing data as a DTM, then build higher level analytical functions on top of it.

Here is what I think might work.

  1. The main thing that is missing from text2vec is a way of keeping track of the actual texts and being able to interrogate them. Obviously this would only work for corpora which can be held in memory. For instance, in textreuse, I can do content(corpus[["doc_id"]]) and see the full text. This is not so helpful if you're interested in training a GloVe model. But in my use cases, I'm often moving back and forth between the macro-level view and the texts themselves. Furthermore, for one of the textreuse algorithms, the Smith-Waterman algorithm, a DTM is not sufficient. It's necessary to have the text itself to do local sequence alignment. So my question is this: is there a way that text2vec can have an option to keep the full text for the documents in a corpus? Or should I just keep that as a special use case in textreuse, and provide an iterator to create a DTM from a TextReuseCorpus object?
  2. It seems to me that there should be a separate package, called tokenizers. There are lots of different tokenizing functions in your package and mine, and they could be usefully collected into a single package, based on stringr. Are you interested in doing this? If so, I can create the package with the tokenizers from textreuse, and you can send me a PR with your tokenizers. Or vice versa. If you're not interested, I might do this anyway, and there is no harm in you retaining the tokenizers in text2vec. But it would be nice to combine efforts.
  3. I'm not surprised that your implementation of LSH is faster than mine. Once LSHR is finished, I can evaluate whether there is still any need for my implementation in textreuse. If there isn't, then I'll rewrite textreuse to use LSHR as well as text2vec.

To sum up: I think you can keep doing what you're doing with text2vec and LSHR. It would be nice to combine efforts in the immediate term (even before the release of text2vec v. 0.3?) on an iterators package. But other than that, I'll work on redoing textreuse once your two packages have stabilized.

BTW: getting to revising the documentation for text2vec as quick as I can.

@lmullen, to your first question: I'm of the opinion to let users manage their documents (for corpora held in memory) as named character vectors, or character columns of a data frame. The names can link the full-text to rows of a DTM or metadata held in a data frame, list, or similar.

As I said to @dselivanov , I started textmineR out of dissatisfaction with existing NLP frameworks in R. (And I am really happy that text2vec is moving in the same direction at a much faster pace.) They tend to be object-oriented (corpus objects etc.) and not a very "R" way of doing things overall. (Far and away, this is my biggest problem with tm.) For most applications I've seen, a data frame, a column of which is the full-text, does a fine job as a corpus object. And it's more familiar/transparent to the R community. (Related: I think a huge advantage of the Matrix package is that its sparse matrices have methods that make working with them almost identical to a standard R dense matrix, no learning curve for useRs.)

So, my suggestion for text2vec would be push corpus management to the users, but have examples that include corpus management through core R objects. Like..

my_dtm <- text2vec::get_dtm(my_dataframe$document_text, ...)

rownames(my_dtm) <- rownames(my_dataframe) # or my_dataframe$ID_column

I don't think it has to limit more experienced users if the lower-level functionality (iterators, etc.) are still available should someone want to do highly-custom work.

Does that make sense? Is there a good counter-argument that I'm missing?

If that's not the direction you all were headed, I'm happy to write and maintain higher-level wrappers and submit them to text2vec or keep them in textmineR. An example of what I mean is the function textmineR::Vec2Dtm, which will shortly have text2vec 0.3 on the back end. ;)

@TommyJones Your point about letting users manage the corpus for themselves in a data frame makes sense to me. Certainly my current research project has been hampered by the object-based way that the most common NLP packages store metadata. Far easier to have it in a data frame.

@dselivanov I'm working on the text2vec vignettes right now, and had question about maintaining document IDs in the rownames, as you mentioned in your blog post. I see how that is possible when you make an iterator over a named character vector. Can text2vec support doing this with a column in a data frame, without creating a separate named character vector. For example, using your movie_reviews data frame with the id column.

@lmullen,

Can text2vec support doing this with a column in a data frame, without creating a separate named character vector. For example, using your movie_reviews data frame with the id column.

I thought about that. Probably need special itoken.data.frame iterator constructor. Will have a closer look today.

@lmullen I think, separate tokenizers package is a great idea. In textreuse I saw skip_grams and character-level tokenizers, which can be very useful. Also I'm not sure whether we should rely on stringr. Mb a better idea is to directly use stringi. Personally I like stringr syntax, but seems Hadley have very little time for package updates. For example stringr with sentence tokenization still not published on CRAN... For the same reason I also do not want to add readr to text2vec dependencies.

Regarding the raw text, I agree, that users should care about it themselves - keep it in files, databases, data frames or whatever sources. We should provide mechanisms for storing the keys for raw data, so users can easily retrieve documents from their own sources.

Okay, I'm glad we agree on tokenizers package. I'll get that started unless
you really want to do it. And I'm pretty sure that I agree for something
low-level like this that a dependency directly on stringi is better.

And what you say about raw text makes a lot of sense. Thanks.

On Tue, Mar 22, 2016 at 11:32 AM, Dmitriy Selivanov <
notifications@github.com> wrote:

@lmullen https://github.com/lmullen I think, separate tokenizers
package is a great idea. In textreuse I saw skip_grams and
character-level tokenizers, which can be very useful. Also I'm not sure
whether we should rely on stringr. Mb a better idea is to directly use
stringi. Personally I like stringr syntax, but seems Hadley have very
little time for package updates. For example stringr with sentence
tokenization still not published on CRAN... For the same reason I also do
not want to add readr to text2vec dependencies.

Regarding the raw text, I agree, that users should care about it
themselves - keep it in files, databases, data frames or whatever sources.
We should provide mechanisms for storing the keys for raw data, so users
can easily retrieve documents from their own sources.


You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
#6 (comment)

Lincoln Mullen, http://lincolnmullen.com
Assistant Professor, Department of History & Art History
George Mason University

Most recent version with cosine similarity:

system.time({
  pairs <- get_similar_pairs(dtm, bands_number = 4, rows_per_band = 32,
                             distance = 'cosine', verbose = TRUE)
})
#   user  system elapsed 
#  0.475   0.009   0.070 
pairs[order(-N)]
# id1  id2 N
# 1: 1054 1417 4
# 2: 1084 3462 4
# 3: 1291 1356 4
# 4: 1615 3846 4
# 5: 2432 4535 2
# ---            
# 166: 4181 4272 1
# 167: 4365 4419 1
# 168: 4447 4527 1
# 169: 4550 4843 1
# 170: 4742 4839 1