ChenghaoMou/text-dedup

Can the suffix_array here be used for Chinese encoded in UTF-8?

BillZid opened this issue · 11 comments

https://github.com/google-research/deduplicate-text-datasets states, "Second, we don't want UTF8 strings. Everything is a [u8] byte array, because we might be working over token sequences which aren't valid UTF8.".

However, in my usage, regardless of how I adjust the value of k (30, 100, 150, 1200), many Chinese documents are reduced to very small lengths and lose their semantic meaning completely. I'm not sure if this is related to the issue mentioned earlier.

does this happen when using the google code or this one?

Can you provide the logs? It should show you how many duplicates it found:

Duplicates found: 82

One thing worth mentioning is that you probably need to make sure all the cache files are deleted between each run.

Can you provide the logs? It should show you how many duplicates it found:

Duplicates found: 82

my logs: Duplicates found: 18074873 with INFO : Before 310137358 bytes (86995) and After 149088578 bytes (69328), and all the cache files are deleted between each run.

does this happen when using the google code or this one?

this one, I don't try the google code because my 'encoding=utf-8'

Can you provide logs when you change k? Could you also provide a small example to reproduce your issue?

Please note that we don't want UTF8 strings means it is encoding agnostic instead of not accepting utf-8 data. This repo is only a wrapper of Google's code.

k means >= k, if your duplicate substrings are already very long, they will always be removed if you choose a small k.

Duplicate substrings are removed based on the advice they give in their repo. Doing this, according to their paper, showed little to no impact on the downstream tasks.

Can you provide logs when you change k? Could you also provide a small example to reproduce your issue?

Please note that we don't want UTF8 strings means it is encoding agnostic instead of not accepting utf-8 data. This repo is only a wrapper of Google's code.

k means >= k, if your duplicate substrings are already very long, they will always be removed if you choose a small k.

Duplicate substrings are removed based on the advice they give in their repo. Doing this, according to their paper, showed little to no impact on the downstream tasks.

I test with 86995 documents, the results with different k :

(before = 310137358 bytes(86995))

  1. k = 30 Duplicates found = 116566212 after = 161268825 bytes (72741)
  2. k = 100 Duplicates found = 22099748 after = 156786700 bytes (70255)
  3. k = 150 Duplicates found = 18074873 after = 149088578 bytes (69328)
  4. k = 300 Duplicates found = 11407956 after = 131281788 bytes (66776)
  5. k = 1200 Duplicates found = 2737692 after = 88521594 bytes (52953)

Also I have a question, why the "After bytes" decrease with the increased k and increased "Duplicates found"?

submit.txt

I give you a sampled 5000 documents file for testing, you can just change the '.txt' to '.jsonl'. Thanks!

30
Duplicates found: 1048053
Before                        : 18567658 bytes (5000)
After                         : 16829693 bytes (4979)

100
Duplicates found: 541777
Before                        : 18567658 bytes (5000)
After                         : 17742479 bytes (4985)

150
Duplicates found: 427173
Before                        : 18567658 bytes (5000)
After                         : 17877999 bytes (4987)

300
Duplicates found: 267233
Before                        : 18567658 bytes (5000)
After                         : 18110347 bytes (4995)

1200
Duplicates found: 78140
Before                        : 18567658 bytes (5000)
After                         : 18412785 bytes (4996)

I cannot reproduce the issue with the data you provided. As you can see here, the duplicate data decreases when k is increased.

Cache files can exist in the following places:

  1. the cache directory you specified in the command line;
  2. any output or temporary files generated in Google's code directory;
30
Duplicates found: 1048053
Before                        : 18567658 bytes (5000)
After                         : 16829693 bytes (4979)

100
Duplicates found: 541777
Before                        : 18567658 bytes (5000)
After                         : 17742479 bytes (4985)

150
Duplicates found: 427173
Before                        : 18567658 bytes (5000)
After                         : 17877999 bytes (4987)

300
Duplicates found: 267233
Before                        : 18567658 bytes (5000)
After                         : 18110347 bytes (4995)

1200
Duplicates found: 78140
Before                        : 18567658 bytes (5000)
After                         : 18412785 bytes (4996)

I cannot reproduce the issue with the data you provided. As you can see here, the duplicate data decreases when k is increased.

Cache files can exist in the following places:

  1. the cache directory you specified in the command line;
  2. any output or temporary files generated in Google's code directory;

Thank you for your quick reply and testing ! I ignored the cache directory in Google's code directory and succeeded on my task, This github repository is really helpful for me.

Also remember to ds.cleanup_cache_files()
I am going to add this for all scripts soon, (@ChenghaoMou will still need to look over it and approve the PRs though)
i am also improving all the dedup scripts atm.
Suffix array is the most unfamilar to me so it will take a while. but when i get around to it, I shall ping you here.

Also remember to ds.cleanup_cache_files() I am going to add this for all scripts soon, (@ChenghaoMou will still need to look over it and approve the PRs though) i am also improving all the dedup scripts atm. Suffix array is the most unfamilar to me so it will take a while. but when i get around to it, I shall ping you here.

Thank you for the reminder and attention to this issue