Issues
- 2
Questions on MinHash Deduplication
#106 opened by XChen-Zero - 3
Performance issues
#105 opened by bowspider-man - 2
minhash deduplication error
#104 opened by bowspider-man - 6
text-dedup
#96 opened by HungHoangDinh - 1
- 1
Improve the readability of the documentation
#98 opened by linear - 1
Improve the usability of the scripts
#99 opened by linear - 1
Fingerprint computation
#100 opened by linear - 1
Clustering
#101 opened by linear - 1
how to dedup short text?
#103 opened by varuy322 - 0
- 5
Run MinHash dedup on Multi-Nodes
#92 opened by alielfilali01 - 2
text-dedup 去重效果怎么样
#91 opened by maoxiangyi - 2
Output Detail
#95 opened by Dodero10 - 0
When I run
#94 opened by Dodero10 - 2
AttributeError: 'DatasetDict' object has no attribute 'shard' when running SimHash deduplication
#93 opened by Dodero10 - 2
Can we use it for Arabic text?
#90 opened by hahmad2008 - 1
- 6
- 4
- 5
数据读取失败
#86 opened by programmerLY - 3
minhash_spark.py [UNABLE_TO_INFER_SCHEMA]
#85 opened by Yang-QW - 2
no module named numpy._typing
#83 opened by Leoooooo123 - 3
- 2
- 8
Consistently seeing more rows being dropped in minhash_spark.py compared to minhash.py
#71 opened by prikmm - 1
Failed to install using `pip install text-dedup`, but succeeded using `pip install -e .`
#84 opened by hancheolcho - 3
- 2
Can we accelerate the groupByKey operation by md5 hashing for the Minhash spark version?
#76 opened by 311dada - 1
- 2
boundaries of sub-strings
#73 opened by MiladMolazadeh - 4
- 5
Error when running on Windows10
#50 opened by soonjune - 3
Papers, Datasets that use this repo
#65 opened by chris-ha458 - 2
- 1
- 4
PySpark without DataProc
#64 opened by scheiblr - 2
Deduplication of union find clusters explained
#62 opened by ZJaume - 2
Python 3.9 compatibility
#59 opened by ZJaume - 3
refactor hash related code
#29 opened by chris-ha458 - 1
- 2
Open up more avenues for discussion
#46 opened by chris-ha458 - 2
Suffix array collect src/main.rs:174 assertion failed: input.len() % size_width == 0
#47 opened by leoMesss - 7
- 3
Is that minhash oom error normal ?
#41 opened by BillZid - 4
FileNotFoundError: [Errno 2] No such file or directory: 'output/temp_text.txt.part.0-7320579'
#38 opened by listentomi - 0
User-controlled cache files clean-up
#39 opened by ChenghaoMou - 11
- 2
- 2