This package remove duplicating texts as memory-friendly way.
It first encodes texts to sha1 hash code, and then store each text according prefix of the code.
For example, the sha1
hash code of string Say hello
is 979e25dd9941e53784ce56e98842f95b8f7fd026
in hex-digit way.
from hashlib import sha1
sha1("Say hello".encode("utf-8")).hexdigest()
# '979e25dd9941e53784ce56e98842f95b8f7fd026'
Therefore the string Say hello
is saved to path/to/shard/97/9e.shard
with corresponding hash code when length of prefix is set to 4.
(head of path/to/shard/97/9e.shard)
979e25dd9941e53784ce56e98842f95b8f7fd026 Say hello
After partitioning texts to shards, it removes duplicating texts in each shard (partition file). And then it merges deduplicated texts.
git clone https://github.com/lovit/text-dedup.git
cd text-dedup
python setup.py install
If not set --max_block_size (-b)
it merges all deduplicated texts into a file path/to/deduplicated.text
.
Or it attaches block index at the end of output
, for example, path/to/deduplicated.text.0
and path/to/deduplicated.text.1
.
Very similar strings have different sha1 hash code value (Say hello
and Say hello.
).
To consider only meaningful characters, you can set --hash_func_input_format (-r)
which is regular expression of input normalizer.
0-9가-힣ㄱ-ㅎㅏ-ㅣa-zA-Z
means that the white space
and the character .
are ignored.
And the sha1 inputs of two strings are transformed to Sayhello
.
It also provides multiprocessing.
Default value is cpu_count - 1
.
Or you can set it manually with -p
or --n_processes
argument.
After merging, the --shard
directory is removed.
If you want to keep the directory, use --keep
argument.
TIPS: Python file I/O is slow, so applying this package to a lots of small text files is slower than applying it to a few of large text files. Therefore, concatenate some small files into a file and then apply this package to the merged files.
text-dedup \
--inputs path/to/textfile [path/to/or/directory] [path/to/wildcard*] \
--shard path/to/shard-directory \
--output path/to/deduplicated.text \
--max_block_size 10Mb \
--hash_func_input_format 0-9가-힣ㄱ-ㅎㅏ-ㅣa-zA-Z \
--prefix_length 4
Elapsed times in i7-5820
num lines / num tokens | file size | prefix length | elapsed time | max shard size |
---|---|---|---|---|
1,607,769 / 18,923,834 | 209MB (a single text file) | 2 | 01:34 | 1.2MB |
1,607,769 / 18,923,834 | 209MB (a single text file) | 4 | 01:50 | 28KB |
33,884,047 / 607,593,167 | 6.0GB (166 text files) | 2 | 27:18 | 107MB |
33,884,047 / 607,593,167 | 6.0GB (166 text files) | 4 | 49:02 | 208KB |