Issues
- 10
How to dedup subtring in one dataset?
#7 opened by lan2016286 - 1
called `Result::unwrap()` on an `Err` value: Os { code: 2, kind: NotFound, message: "No such file or directory" }
#51 opened by bingkunyao - 5
- 0
Distributed running
#49 opened by jordane95 - 0
does finish_dedup_wiki40b.py has some wrong?
#48 opened by mathCrazyy - 1
does this tool can process Chinese?
#47 opened by mathCrazyy - 1
- 13
Accessing the duplicates and their counts
#20 opened by yanaiela - 1
Question: Upper Bound
#44 opened by bezir - 2
Count_occurrence does not work with tokenizer?
#43 opened by WWWonderer - 2
question about wstring_equal function
#41 opened by WWWonderer - 1
是否可以提供一个纯python版本的,相信很多研究者在服务器上没有权限安装gcc
#42 opened by gongye19 - 4
when i use tokenizer , I obtained many patterns that span across the data, which is quite strange.
#39 opened by gawei1995 - 1
customized dataset deduplication
#38 opened by zengyangjie - 1
where the data is?
#40 opened by jianshu93 - 4
- 1
- 7
how to deduplicate huggingface datasets
#21 opened by StephennFernandes - 2
- 1
Incomplete Sentences
#34 opened by MiladMolazadeh - 1
remove_ex in finish_dedup_wiki40b
#35 opened by wead-hsu - 1
How to restore the result data after deduplication (remove invisible characters)
#29 opened by greenriver777 - 3
- 15
Error when running the code
#12 opened by MatthewCYM - 2
Retain one instance per duplicate
#32 opened by RobinQrtz - 2
RAM crash when use collect method
#18 opened by acul3 - 1
Inplementation of NearDup(approximate match)
#27 opened by Yaoming95 - 1
- 0
Simple test
#26 opened by KeremTurgutlu - 0
Off-by-1 error in `collect`?
#24 opened by ola13 - 2
question about deduplication cluster size
#23 opened by everks - 2
one bug when I use
#17 opened by flyingwaters - 1
Should newline char be removed
#16 opened by cperiz - 2
Unexpected behavior with ending symbols
#15 opened by mitya52 - 2
"failed to fill whole buffer" errors
#14 opened by mitya52 - 7
- 20
Can the tool run on plain text files?
#8 opened by m-resta - 1
false positives
#9 opened by ChenghaoMou - 7
How to dedup between two datasets?
#3 opened by mralexis1 - 10
Error on self deduplication
#5 opened by zijwang - 3
Why not use Simhash?
#4 opened by Ethan-yt