neurosnap/sentences

optimization suggestion

lloyd opened this issue · 5 comments

lloyd commented

staring at a profile at the moment where it appears that regex compilation happens at each tokenization. Seems like caching compiled regexs would make this (awesome) library twice as fast for use on large corpus?

out

Greetings!

Thanks for the compliment about the library, glad to see it's being used with success.

Great suggestion, it turns out that simply caching these regular expressions amounted to a significant decrease in test run times.

Before:

10:25 $ make test
go test ./...
ok      github.com/neurosnap/sentences  0.161s
?       github.com/neurosnap/sentences/cmd/sentences    [no test files]
?       github.com/neurosnap/sentences/data [no test files]
ok      github.com/neurosnap/sentences/english  0.038s
?       github.com/neurosnap/sentences/utils    [no test files]

After:

10:34 $ make test
go test ./...
ok      github.com/neurosnap/sentences  0.046s
?       github.com/neurosnap/sentences/cmd/sentences    [no test files]
?       github.com/neurosnap/sentences/data [no test files]
ok      github.com/neurosnap/sentences/english  0.034s
?       github.com/neurosnap/sentences/utils    [no test files]

I'd love some larger scale performance tests for this package considering it is a big reason to use this library over NLTK. If you have any suggestions on corpus to test or if you wanted to contribute I'd be happy to help in any way possible.

Out of curiosity, are you using this library for english sentence tokenization? If so then I would like to point out I have extended the base tokenizer specifically fixing some of the common errors that I have noticed from the punkt sentence tokenizer:

https://github.com/neurosnap/sentences#english

lloyd commented

Greetings back at you!

Interesting. The after looks like to me that runtime has dropped 2.6x ?? sentences finishes in 0.046s instead of 0.161s. am I reading this wrong?

I'll think about freely available test corpus...

lloyd commented

oh, maybe I misunderstood you. the sentence "simply caching these regular expressions amounted to a significant increase...". But Now I see you actually landed a change to cache complied regexs!

ignore my pull request!

Woops, that was my bad, I found a significant decrease in test run times.

lloyd commented

awesome. thanks for listening and landing a change! that shaved 1 hour off a 7 hour batch process we run that uses your code. 🚀