optimization suggestion

Question

optimization suggestion

lloyd opened this issue 9 years ago · 5 comments

staring at a profile at the moment where it appears that regex compilation happens at each tokenization. Seems like caching compiled regexs would make this (awesome) library twice as fast for use on large corpus?

Answer 1 · 2016-02-27T15:45:04.000Z

Greetings!

Thanks for the compliment about the library, glad to see it's being used with success.

Great suggestion, it turns out that simply caching these regular expressions amounted to a significant decrease in test run times.

Before:

10:25 $ make test
go test ./...
ok      github.com/neurosnap/sentences  0.161s
?       github.com/neurosnap/sentences/cmd/sentences    [no test files]
?       github.com/neurosnap/sentences/data [no test files]
ok      github.com/neurosnap/sentences/english  0.038s
?       github.com/neurosnap/sentences/utils    [no test files]

After:

10:34 $ make test
go test ./...
ok      github.com/neurosnap/sentences  0.046s
?       github.com/neurosnap/sentences/cmd/sentences    [no test files]
?       github.com/neurosnap/sentences/data [no test files]
ok      github.com/neurosnap/sentences/english  0.034s
?       github.com/neurosnap/sentences/utils    [no test files]

I'd love some larger scale performance tests for this package considering it is a big reason to use this library over NLTK. If you have any suggestions on corpus to test or if you wanted to contribute I'd be happy to help in any way possible.

Out of curiosity, are you using this library for english sentence tokenization? If so then I would like to point out I have extended the base tokenizer specifically fixing some of the common errors that I have noticed from the punkt sentence tokenizer:

https://github.com/neurosnap/sentences#english

Answer 2 · 2016-03-01T06:33:05.000Z

Greetings back at you!

Interesting. The after looks like to me that runtime has dropped 2.6x ?? sentences finishes in 0.046s instead of 0.161s. am I reading this wrong?

I'll think about freely available test corpus...

Answer 3 · 2016-03-01T06:59:49.000Z

oh, maybe I misunderstood you. the sentence "simply caching these regular expressions amounted to a significant increase...". But Now I see you actually landed a change to cache complied regexs!

ignore my pull request!

Answer 4 · 2016-03-01T12:08:43.000Z

Woops, that was my bad, I found a significant decrease in test run times.

Answer 5 · 2016-03-18T05:24:05.000Z

awesome. thanks for listening and landing a change! that shaved 1 hour off a 7 hour batch process we run that uses your code. 🚀