ChenghaoMou/text-dedup

Out of memory on Spark

311dada opened this issue · 1 comments

Thanks for your great work first!

Recently, I deduplicate some Chinese books (~2000 books with 10G). I adopt the jieba tokenizer while the Spark throws out of memory error at the groupby statement. I increase the executor memory to 65G while not working. Could you help me with where the memory costs most? THX

Sry to disturb you. I found the issue is caused by the long document. Fixed!