ChenghaoMou/text-dedup

Load data from serveral workers with minhash_pyspark.py

nguyenhuuthuat09 opened this issue · 7 comments

Hi, I'm setting up a local Spark cluster but I have a problem that my data is too large and cannot be stored in a single machine in the cluster for Pyspark to load and process later.

My data consists of some small subsets and these small subsets are small enough to store on a machine. I wonder if Pyspark can load different data stored in different worker machines then collect and process that data?

I am also thinking of another option which is to deduplicate each part of the data, then merge it and continue deduplicating. I think since we use Minhash this will probably still produce the equivalent result as deduplicating all the data at once. What do you think about this option?

Thank you!

This repo

Theoretically, I think it would be possible.
You would use the uf datastructure that is pickled when using the debug argument and reload it and then continue as normal.
This usage is a bit brittle but I think the developer of this repo is planning on a more streamlined approach for this.
I think something like this is what they mean by
Inter-dataset deduplication in the readme page.
You will have to be very careful to preserve all settings including shingling, ngram splitter/tokenizer choice, hash_function choice, and more parameters. I think literally everything except the input data itself has to be EXACTLY the same. Avoid using defaults and manually configure as much as possible.

Alternative

Datasketch repo has much better support for this and they take particular care to maintain compatibility across versions. It primarily provides primitives so using it would be a bit more difficult, but many people do so successfuly.

@ChenghaoMou if you are planning to implement this and have a clear vision as to how, feel free to share. I'll help you build it.

As you may know, most of what I've been doing on this commit was trying to make it so that I wouldn't need to resort to approaches such as this, but it may not be up to me.

Hi, thank you for your reply!

I think I found a way to solve this problem. That is to use an additional Hadoop cluster to store data for the Spark cluster. This way I will have enough storage to store the data.

To be honest I don't have much experience with Pyspark so it took me quite a while to figure this out.

But I think Inter-dataset deduplication feature will be very useful. Because not everyone has powerful enough machines to use. Or in case we get new data, it is possible to simply merge the new data with the de-duplicated old data and remove the duplicates.

I strongly agree. There are other software (possibly including private?) that work well for this purpose in pyspark, but they might as well be black magic to me.

Hi, I'm setting up a local Spark cluster but I have a problem that my data is too large and cannot be stored in a single machine in the cluster for Pyspark to load and process later.

Originally, all the scripts were optimized for speed, given enough computing resources. With that being said, there was always some heuristics to narrow down the scope, such as languages. Make sure you have some evidence to tell you that you need to deduplicate at the highest level.

My data consists of some small subsets and these small subsets are small enough to store on a machine. I wonder if Pyspark can load different data stored in different worker machines then collect and process that data?

I am also thinking of another option which is to deduplicate each part of the data, then merge it and continue deduplicating. I think since we use Minhash this will probably still produce the equivalent result as deduplicating all the data at once. What do you think about this option?

It is possible to re-design all scripts for memory constraints, at a cost of prolonged processing time. For example:

Subset 0 Subset 1 Subset 2
Subset 0 self dup across dup across dup
Subset 1 x across dup across dup
Subset 2 x x self dup

Merging all duplicates/ids after all pairs, you can have your final results. This is already implemented in https://github.com/google-research/deduplicate-text-datasets, which is mentioned in #38 #42 here and was once implemented in this repo as well. It is later removed for lack of interest.

However, it is not equivalent if you iteratively dedup and merge some subsets, because you could have different results:

A = [am, b]
B = [amb, q]
C = [mb, t]

E = dedup(A + B) = [am, b, q] # similarity(am, amb) >= threshold
F = dedup(E + C) = [am, mb, b, q, t] # similarity(am, mb) < threshold

full_F = dedup(A + B + C) = [am, b, q, t] # similarity(amb, md) >= threshold

You are right. I missed the case that you said. I'm gonna close this issue.