ChenghaoMou/text-dedup

many duplicate pairs were not actually similar using minhash_spark.py

Closed this issue · 1 comments

minhash_spark.py

When I used this code to process English CC data, I found that many duplicate pairs were not actually similar, but similar pairs could indeed be captured. Why is this? I have tried many sets of parameters (ngram_size, B, R) and the same is true. Is there any optimal parameter recommendation?
In addition, when I processed Chinese CC data, I did not encounter such a situation.

Thanks for the question!

Typically, if you are concerned with false positives, you can add a false positive check in the clustering stage. This should check the Jaccard similarity within each cluster between all pairs.

This is not added here because it will slow down the processing and show no benefits in BigCode experiments when the dataset is large.