Issues
- 5
数据读取失败
#86 opened by programmerLY - 3
minhash_spark.py [UNABLE_TO_INFER_SCHEMA]
#85 opened by Yang-QW - 4
Run MinHash dedup on Multi-Nodes
#92 opened by alielfilali01 - 1
text-dedup 去重效果怎么样
#91 opened by maoxiangyi - 2
no module named numpy._typing
#83 opened by Leoooooo123 - 1
Can we use it for Arabic text?
#90 opened by hahmad2008 - 0
- 5
- 3
- 3
- 2
- 8
Consistently seeing more rows being dropped in minhash_spark.py compared to minhash.py
#71 opened by prikmm - 1
Failed to install using `pip install text-dedup`, but succeeded using `pip install -e .`
#84 opened by hancheolcho - 3
- 2
Can we accelerate the groupByKey operation by md5 hashing for the Minhash spark version?
#76 opened by 311dada - 10
Suffix Array consumed time
#22 opened by kimcando - 1
- 2
boundaries of sub-strings
#73 opened by MiladMolazadeh - 4
- 5
Error when running on Windows10
#50 opened by soonjune - 3
Papers, Datasets that use this repo
#65 opened by chris-ha458 - 2
- 1
- 4
PySpark without DataProc
#64 opened by scheiblr - 2
Deduplication of union find clusters explained
#62 opened by ZJaume - 2
Python 3.9 compatibility
#59 opened by ZJaume - 3
refactor hash related code
#29 opened by chris-ha458 - 1
- 2
Open up more avenues for discussion
#46 opened by chris-ha458 - 2
Suffix array collect src/main.rs:174 assertion failed: input.len() % size_width == 0
#47 opened by leoMesss - 7
- 3
Is that minhash oom error normal ?
#41 opened by BillZid - 4
FileNotFoundError: [Errno 2] No such file or directory: 'output/temp_text.txt.part.0-7320579'
#38 opened by listentomi - 0
User-controlled cache files clean-up
#39 opened by ChenghaoMou - 9
- 11
- 4
How to get jaccard score for minhash spark?
#21 opened by rin2401 - 2
- 2
- 5
NameError: name 'uf' is not defined
#23 opened by duytran1332002 - 1
- 4
How to get duplicates cluster ids?
#18 opened by konradkalita - 6
the ngram setting of minhash
#17 opened by liujuncn - 1
duplicated substring removal in suffix_array.py
#19 opened by ctrajan - 1
Out of memory on Spark
#20 opened by 311dada - 2
- 4
any document or example?
#16 opened by paulcx - 3
Suffix array clean up
#14 opened by KeremTurgutlu - 2
New release
#13 opened by KeremTurgutlu - 3
Question about code of spark.py
#12 opened by ctrajan