How to speed up the application to 100k documents?

Question

How to speed up the application to 100k documents?

skwolvie opened this issue 3 years ago · 3 comments

Hi,
It works well with 1 document however if i want to apply this kw_extractor to a 100k rows of documents with pandas apply it takes more than 2 days to complete. Is there any way of speeding up this process?

CODE:
import yake
st = set(stopwords.words('japanese'))

def keywords_yake(sample_post):
    # take keywords for each post & turn them into a text string "sentence"
    simple_kwextractor = yake.KeywordExtractor(n=3, 
                                            lan='ja',
                                            dedupLim=.99, 
                                            dedupFunc='seqm', 
                                            windowsSize=1, 
                                            top=1000, 
                                            features=None,
                                            stopwords=st)
    
    post_keywords = simple_kwextractor.extract_keywords(sample_post)

        sentence_output = ""
        for word, number in post_keywords:
            sentence_output += word + " "        
    return " ".join(sentence_output)


df['keywords']= df['docs'].apply(lambda x: keywords_yake(x))
```

Answer 1 · 2021-07-14T22:36:12.000Z

Hi @skwolvie
For such volume I would recommend using the SparkNLP's YAkE implementation.

You can find more about it here
https://nlp.johnsnowlabs.com/docs/en/annotators#yake

Since you can distribute Spark processing I think it could be a good fit for your use case.

Answer 2 · 2021-07-15T16:53:56.000Z

Can you please provide a kickstart on how to do it with spark NLP? Also, If that is not possible i would like to understand why it takes so long with pandas apply method. It takes less than a second to apply yake to 1 document but the time taken increases exponentially as the no of documents increases. 100k rows is not a huge dataset.

Answer 3 · 2022-07-29T07:42:24.000Z

I felt that deduplication is where it takes longer than anything.