swisscom/ai-research-keyphrase-extraction

Extraction is very slow?

Closed this issue · 5 comments

I am running extraction on short sentences, roughly 20 words.

I have followed your instructions to load the embedding model and the part of speech tagger once.

However, it takes about 3 seconds per extraction on a dedicated machine.

Is this expected? How can I make extraction faster?

On which language ?

English.

The bottleneck is certainly due to the pos tagging.
You can check that it's actually the bottleneck by trying the following pos_tagger.pos_tag_raw_text(your_text).

As you can see here in the provided code we rely on StanfordPosTagger implementation of nltk for convenience https://www.nltk.org/_modules/nltk/tag/stanford.html which is not ideal if you want a fast extraction.

In order to make extraction faster I'll suggest you to create your own PosTagger that follows our PosTagging interface by implementing the pos_tag_raw_text method and ensure that this PosTagger remains available and ready for online predictions.

One possibility that should be fast to implement and try is to create a PosTagging class that makes calls to a Core NLP Server . It seems that there is a well maintained python wrapper for CoreNLP https://github.com/Lynten/stanford-corenlp that you can use to create your custom PosTagging class, you'll just have to shutdown the CoreNLP server when you are done with the keyphrase extraction of all your documents.

EDIT : With nltk 3.3 , they now use a CoreNLP server in order to mitigate the low performance of the previous wrapper, so you can try to use ntlk 3.3 and define your own PosTagging class that use CoreNLPParseras explained here . I will certainly update the project with nltk 3.3 next week.

Solved with #25

I am running extraction on short sentences, roughly 20 words.

I have followed your instructions to load the embedding model and the part of speech tagger once.

However, it takes about 3 seconds per extraction on a dedicated machine.

Is this expected? How can I make extraction faster?

Did you manage to improve the extraction time? Because on my server, the extraction takes 2 minutes with a 16gb pre-trained model file.