ChenghaoMou/text-dedup

Is that minhash oom error normal ?

Closed this issue · 3 comments

I get that error when use the minhash program here.

Iterating MinHashes...: 17%|█▋ | 342/1982 [2:04:34<4:07:05, 9.04s/it]

python -m text_dedup.minhash --batch_size 10000 --column "text" --num_perm 9000 --b 450 --r 20

error: Detected 1 oom-kill event(s) in step 2008.batch cgroup. Some of your processes may have been killed by the cgroup out-of-memory handler.

I have input a 65GB chinese josnl file with 19816925 documents, using a 700GB memory and 70 cores cpu, but get that error.

When the file size doubles to 130GB, the error appears at Iterating MinHashes...: 9% , and the batch_size sets to 5000 don't influnce the 9% error.

According to this calculation, 700GB of memory can only handle about 10GB of data. I wonder if this is normal?

9000 means 9000 integers for each document. You would need at least 9k * 64bit * 19816925 ~ 1.4 TB of memory for this.

I would guess this number is from the Google paper, but they run that on their in-house cluster implemented in Spark. You might want to change the settings for your hardware.

Most published research use a permutation number of 1000 or lower. Do you need to go over it? since these research are often for western languages maybe you need more.
Also the default ngram shingling here is not optimal for non spaced separated languages. You might need to write your own. (For Chinese morpheme analyser or Chinese word tokenizer is good enough).

Also 32 bit and 16 bit hashes have been implemented. You can use it by setting hash_bits arguments.

In summary :
First try lower permutations(start with 1000 go down to 256)
Secondly try lower bits
(Start with 64(default), 32 go down to 16)
If both of this fails please let us know. It it is still a proper oom issue at that point you might need to incorporate a Chinese tokenizer.

I get that error when use the minhash program here.

Iterating MinHashes...: 17%|█▋ | 342/1982 [2:04:34<4:07:05, 9.04s/it]
python -m text_dedup.minhash --batch_size 10000 --column "text" --num_perm 9000 --b 450 --r 20
error: Detected 1 oom-kill event(s) in step 2008.batch cgroup. Some of your processes may have been killed by the cgroup out-of-memory handler.

I have input a 65GB chinese josnl file with 19816925 documents, using a 700GB memory and 70 cores cpu, but get that error.

When the file size doubles to 130GB, the error appears at Iterating MinHashes...: 9% , and the batch_size sets to 5000 don't influnce the 9% error.

According to this calculation, 700GB of memory can only handle about 10GB of data. I wonder if this is normal?

In the end, I just ran it successfully with the parameter "--num_perm 800 --b 400 --r 20" in this 65GB chinese file (another parameter setting recommended in the Google paper), without Chinese word tokenizer and got good results, as a reference for latecomers. Thank you both.