ChenghaoMou/text-dedup

Consistently seeing more rows being dropped in minhash_spark.py compared to minhash.py

prikmm opened this issue · 8 comments

prikmm commented

Hi @ChenghaoMou ,

I have been using minhash_spark.py via GCP dataproc (removed all the code present for bigcode) for deduplicating my multi-lingual dataset. To get an understanding on the reproducibility of the result, I also deduped the same multi-lingual dataset using minhash.py.

Currently, deduplication is performed on one individual language at a time.

When I ran this for the first language, I witnessed that minhash.py retained around 15-20% more documents as compared to minhash_spark.py.
minhash_spark.py output had around ~12M documents and minhash.py output had around ~14.5M documents.

In #28 , you have mentioned that for same algorithm, although the documents being removed is random, the number of documents being removed is same. But, I am witnessing different behaviour.

To validate my experience, I ran the deduplication over rest of the language subsets and found more documents were being dropped in minhash_spark.py.

It would be great if you can help me better understand this by answering a few question:

  1. Does connected components used in minhash_spark.py create different clusters than union-find used in minhash.py?
  2. If the number of clusters are same then shouldn't the number of samples in the outputs for both the scripts be same?
  3. Is running the scripts on different machines responsible for this behaviour? If yes, what is the reason for this behaviour.

I would be grateful if you can share any info apart from the above questions which can help me troubleshoot this behaviour!

Thanks.

Hi @prikmm,

Thanks for creating this PR. One thing that might explain the disparity — the num_perm is slightly different in the two scripts (256 vs 250), though only b*r permutations are being used in all settings. This creates different PERMUTATIONS and results as a consequence. It should be fixed in the latest commit. Let me know if the issue persists.

prikmm commented

Hi @ChenghaoMou

Thanks for the looking into the issue.

But, regarding num-perm. I did handle that.

I set num-perm to 256 for running both minhash.

I forgot to mention this during issue creation.

Infact all the 4 parameters are same:
length: 5
n-gram: 5
threshold: 0.7
num-perm: 256

I see. Could you provide any example data for me to reproduce the issue? Could you also share the exact command you use to run the scripts?

Hi @ChenghaoMou , I'm facing the same problem using another local minhash deduplication, which removes significantly less documents than spark implementation. See huggingface/datatrove#107

@jordane95

Can you share more details? Like the command or the log output?

I took a look at the dataset you shared. The immediate observation is that that particular dataset might not be suitable with near deduplication, especially when the Q is significantly longer than the answers. You would remove them as a result of Q overshadowing A with dominate ngrams, even though diverse answers to the same question might not be considered duplicates in reality. You might find it more helpful to do hierarchical deduplication: grouping similar Qs together then dedup based on As only within each group.

With regarding to your question, the text preprocessing before ngrams are significantly different in two repos. Different hashing functions can also lead to different performance. I don't think I can expect the same results. In fact, you can run the normal minhash script in this repo, which is based on UF, and compare the results with the spark one more directly. Though, the parity between two implementations are not guaranteed.

I have tried another implementation of starcoder, which produces nearly same deduplication rate as datatrove implementation. Is there any reason why bigcode didn't use the graphframe implementation in this code but re-implement it using self-defined functions?

We moved away from union find to spark implementation and then graphframe. Graphframe is used in the latest V2 (to be released): https://github.com/bigcode-project/bigcode-dataset/blob/main/near_deduplication/bigcode-v2/intra_dedup.py

Without details, I can't offer much help.

Stale issue message