How to speed up the application to 100k documents?
skwolvie opened this issue · 3 comments
Hi,
It works well with 1 document however if i want to apply this kw_extractor to a 100k rows of documents with pandas apply it takes more than 2 days to complete. Is there any way of speeding up this process?
CODE:
import yake
st = set(stopwords.words('japanese'))
def keywords_yake(sample_post):
# take keywords for each post & turn them into a text string "sentence"
simple_kwextractor = yake.KeywordExtractor(n=3,
lan='ja',
dedupLim=.99,
dedupFunc='seqm',
windowsSize=1,
top=1000,
features=None,
stopwords=st)
post_keywords = simple_kwextractor.extract_keywords(sample_post)
sentence_output = ""
for word, number in post_keywords:
sentence_output += word + " "
return " ".join(sentence_output)
df['keywords']= df['docs'].apply(lambda x: keywords_yake(x))
```
Hi @skwolvie
For such volume I would recommend using the SparkNLP's YAkE implementation.
You can find more about it here
https://nlp.johnsnowlabs.com/docs/en/annotators#yake
Since you can distribute Spark processing I think it could be a good fit for your use case.
Can you please provide a kickstart on how to do it with spark NLP? Also, If that is not possible i would like to understand why it takes so long with pandas apply method. It takes less than a second to apply yake to 1 document but the time taken increases exponentially as the no of documents increases. 100k rows is not a huge dataset.
I felt that deduplication is where it takes longer than anything.