ChenghaoMou/text-dedup

Doubts about the multiprocessing start method and behavior

LoveCatc opened this issue · 9 comments

First of all, thanks for developers' time in creating such a useful tool.
I am trying to modify the text_dedup/minhash.py and use it for deduplication of a ~800GB dataset with each document size varing from 5GB to 50GB. The detailed code running environment is as follows.

CPU: 128-cores
Memory: 2TB
OS: Ubuntu 20.04.5 LTS
Kernel: 3.10.0-1160.el7.x86_64
User: root
Python version: 3.8.10

And I run the code with following command (I modify the data-loading code so that it can handle json files directly):

python -m text_dedup.custom_minhash --path /path/to/src --output /path/to/dst --column text --ngram 7 --min_length 7 --num_perm 64 --threshold 0.85 --local

I found that when the code ran into some specific steps, such as fingerprinting, it seemed that only 1 process was working as I could only see 1 core has 100% load at the same time. And this step could take extremely long time, namely about 3 hours. I tried to search in the multiprocessing documentation and other related sources, but I cannot figure out why. I tried to change the multiprocessing method multiprocessing.set_start_method, but this does not work too.

So I am wondering, is this an expected situation or is there something going wrong? And why did you use fork in the multiprocessing start method? Any special consideration?

Appreciate your time and patience!

When the code is running, I try to use htop to inspect the process status, it looks like:
image

And it remains like this until this step is finished. No son process has ever shown R.

Something is definitely wrong here. Could you share your modified code here?

If map function is not parallelising properly, it might be a datasets issue.

Also, just to make sure I am reading your message correctly, are you using the script to dedup less than 160 files? The file size seems too big to be suitable for minhash.

Thanks for replying! The modified parts are mainly about dataset loading and saving. The other modified parts would only be performed when --debug is specified. As I am not using --debug here and saving is functioning properly, I would like to only explain the loading part. It is like:

if args.local:
    # ds = load_from_disk(args.path)
    ds = load_dataset(
        "json",
        data_files=get_all_files(
            args.path,
            legal_file_type=('.json','.jsonl'),
            need_abspath=True,
        )[0]
    )
ds = ds['train']

I modify the code behavior when loading from local. Here get_all_files would return a tuple (paths: List[str], n_files: int), where paths is the file paths of the json files. I am not sure if this would cause trouble but I see similar usage in official documentation, about how to load local json files. I will search in the documentation and community to find out if there are similar problems about datasets.
Another thing you mention, sorry for the mistakes. Originally the files were actually so big, but I've run some processing so now each file is 50~100M, and the ~800GB dataset consists of ~9900 files. I am not sure if this file size is suitable for minhash. If you would spare your time, could you please explain why file sizes would affect the usage of minhash or simply provide some supplemental materials for me to learn? I thought that the program would always read the data from all the files then process them together as ONE dataset, and the only problem of too big file size might be the slow reading speed.

I retry the 3 types of multiprocessing start methods this morning, and I find that when I use spawn as the start method, finally I can see TWO processes running together. But it is still weird enough, because:

  1. I manually specify n_proc=120 but there are ONLY 2 processes working. Yet there are still other processes just like the screenshot above, which never run and consume exactly the same VIRTUAL memory as the running one. And the consumed memory is considerably less than fork, now only ~100GB ram is used, when the fork ~400GB as I remembered.
  2. The TWO running processes do not double the efficiency; actually both of them have only ~50% cpu loads, so in total they exactly work at almost the same speed or even slower than the fork one (one process running but 100% load), as told by the iter speed (~30000 examples/s) in tqdm.
  3. I further sort the related processes by the TIME+, and I find that the running processes are actually the earliest one(main process, or the one exactly runs in fork) and the latest one(has shortest TIME+ and biggest PID).

I don't know if the information could help. Feel free to inform me if any further information is needed.

Hey, I debug the code and finally find the cause. I modified the ngram part in embed_func but I really forgot about it. I modify the ngram code for it to properly adjust to multiprocessing. Now the program runs very well. Sorry to bother you and really thankful for all the things you've done!
Still I would like to enquire you about the file size problems. I still cannot find out the exact influences of too big file sizes. Also I wonder you considerations of choosing fork as the multiprocessing start method. Anyway I appreciate for your time and patience.

No worries. Glad you fixed the problem. Here is my reasoning behind the big size, assuming it is natural language text data: minhash is a way of approximating jaccard similarities with sets of ngrams. If two files are sufficient large, their ngram/vocabulary differences will be very small to be discernible by the approximation. If you are comparing DNA sequences with a bigram, any type of DNA sequence will likely be clustered as similar, in which case you need to have a relatively large ngram size. If you are trying to deduplicate different datasets, there is another way to do it.

Thanks for replying! I may apologize for the ambiguity. I actually include many pieces of jsonlines within one .json file. For example, one json file may contain 10k pieces of texts, each of which is approximately hundreds of chars length. In such cases I think minhash is actually excellently suitable, in accordance with the benchmark you've given.

That makes a lot of sense then.

And for the fork settings you asked, originally it is used for reduce memory usage in macOS. Fork in this case means that the shared object uf is not copied to child processes as long as it is not modified (it is like a read reference). And the surrounding code that stops the garbage collector is also used for this reason. This behaviour is also dependent on the system implementation, e.g. Windows might behave very differently in this case. Additionally, memory usage reported using top or htop might not reflect the real usage.

In short, fork should work reasonably well on macOS and Unix systems, but feel free to test different settings if they work better on your system.

Closing it for now. Feel free to open another one if you have more issues.