Run MinHash dedup on Multi-Nodes

Question

Run MinHash dedup on Multi-Nodes

Closed this issue 10 days ago · 5 comments

Hello there,

First i would like to extend my many thanks to you for setting up this amazing repo !

I'm currently working on a project with the aim to release the largest clean arabic text dataset, we have now about 313 billion tokens (qwen tokenizer) gathered from multiple clean sources ...
The quality (toxicity, URL cleaning, short sentences, ...) itself is unquestionable so the major step we still have to perform is dedup, and for that we choosed your tool which is both clean and straightforward !
One problem raises is due to the large size of the dataset we cannot perform dedup on a single node (52 AMD CPUs) in a reasonable time ! For that we want to explore the possibility to run the script on multi nodes setting, 52 CPU per Node, on about 50 Nodes.

Do you think we can do that with the current available in this repo ? If yes, i would appreciate your guidance on this matter.

Thank you again and looking forward to hear from you soon

Answer 1 · 2024-06-18T17:46:45.000Z

Hi @alielfilali01

Thanks for reaching out.

313 billion tokens sounds doable with a decent cluster based on my experience. For reference, The spark script was tested with TB-level dataset with less than 20 nodes. Can I ask if your hardware is on any commercial cloud platform or a local HPC? I am more than happy to jump on a call to discuss more details if you want.

Best,
Chenghao

Answer 2 · 2024-06-20T22:12:20.000Z

Hi dear @ChenghaoMou

Thank you so much for your quick response and your openess 🤗

Personally, I know very little on text dedup, so all i did was copy past the command on the main README file pointing to the directory of my dataset ... I have no idea on how i could do it using Spark !
For the Hardware, it is a local HPC located in the UM6P Campus - Morocco

If you can walk me here through the process on how to run MinHash dedup using Spark in a multi-node setting, that would be better ! This way i won't get too much from your time. Otherwise if you believe a call would be better, i would be happy and honored to do it.

Thank you so much again 🤗

Answer 3 · 2024-06-21T08:58:02.000Z

Thanks for the details. In this case, you might have at least two options:

Try datatrove with its slurm pipeline executor for deduplication with minimal HPC configuration and knowledge.
Set up your HPC for spark documentation and documentation (make sure you double check the spark script because it might contain some different settings than the normal script) You should consider setting the spark cluster with HPC and running the script (spark-submit) as separate steps. The latter is only one command when the cluster is ready.

There is some amount of learning and tweaking involved in either method. But it is a one-time thing that you will find useful for future experiments. I suggest starting with a small cluster and small data to make sure everything runs before scaling it up. Unfortunately, for HPC with limited access, I won't be able to directly help. Feel free to post any issues or questions still, I will try my best to help you answer them.

Best,
Chenghao

Answer 4 · 2024-06-21T10:07:55.000Z

Thank you so much dear @ChenghaoMou
I'll get back to you if i have any more questions

Answer 5 · 2024-08-20T17:47:17.000Z

Stale issue message