ChenghaoMou/text-dedup

How to get jaccard score for minhash spark?

rin2401 opened this issue · 4 comments

I see in the code, threshold is only use to select optimal_param, not use to filter duplicate pairs have jaccard score <= threshold.

You can add a secondary filtering yourself by calculating the actual Jaccard similarity. This is not used in most of my or BigCode experiments, because for our particular datasets, it does not improve the quality and slows down the process a lot. Additionally, we want to remove more data and can afford certain amount of false positives.

You can refer to this blog for more details https://huggingface.co/blog/dedup.

Closing this for now, feel free to open another issue if you have any more questions.

@ChenghaoMou Hi, can you please share how much time it takes when using jaccard similarity compared to when not using jaccard similarity?

@nguyenhuuthuat09 Unfortunately, I don't have a number for this. Because using Jaccard similarity in the second stage never met our needs in dealing with large datasets. You can find one implementation for this at https://github.com/huggingface/transformers/blob/main/examples/research_projects/codeparrot/scripts/minhash_deduplication.py

and test it on your own dataset.