ChenghaoMou/text-dedup

Suffix array collect src/main.rs:174 assertion failed: input.len() % size_width == 0

leoMesss opened this issue · 2 comments

Cleaning up                                                                                                                                                                                                            
    Finished dev [optimized + debuginfo] target(s) in 3.32s                                                                                                                                                            
     Running `target/debug/dedup_dataset self-similar --data-file output/temp_text.txt --length-threshold 100 --cache-dir ./tmp/cache --num-threads 128`                                                               
Start load!                                                                                                                                                                                                            
0 / 60318160                                                                                                                                                                                                           
3093441 / 60318161                                                                                                                                                                                                     
9500721 / 60318161                                                                                                                                                                                                     
19001441 / 60318161                                                                                                                                                                                                    
28502161 / 60318161                                                                                                                                                                                                    
34909441 / 60318161                                                                                                                                                                                                    
44410161 / 60318161                                                                                                                                                                                                    
53910881 / 60318161                                                                                                                                                                                                    
Duplicates found: 191378774                                                                                                                                                                                            
Total time taken: 70972ms                                                                                                                                                                                              
    Finished dev [optimized + debuginfo] target(s) in 0.89s                                                                                                                                                            
     Running `target/debug/dedup_dataset collect --data-file output/temp_text.txt --length-threshold 100 --cache-dir ./tmp/cache`                                                                                      
thread '<unnamed>' panicked at 'assertion failed: input.len() % size_width == 0', src/main.rs:174:5                                                                                                                    
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace                                                                                                                                          
thread '<unnamed>' panicked at 'assertion failed: input.len() % size_width == 0', src/main.rs:174:5                                                                                                                    
thread '<unnamed>' panicked at 'assertion failed: input.len() % size_width == 0', src/main.rs:174:5                                                                                                                    
thread '<unnamed>' panicked at 'assertion failed: input.len() % size_width == 0', src/main.rs:174:5                                                                                                                    
thread '<unnamed>' panicked at 'assertion failed: input.len() % size_width == 0', src/main.rs:174:5                                                                                                                    
thread '<unnamed>' panicked at 'assertion failed: input.len() % size_width == 0', src/main.rs:174:5                                                                                                                    
thread '<unnamed>' panicked at 'assertion failed: input.len() % size_width == 0', src/main.rs:174:5                                                                                                                    
thread '<unnamed>' panicked at 'assertion failed: input.len() % size_width == 0', src/main.rs:174:5                                                                                                                    
thread '<unnamed>' panicked at 'assertion failed: input.len() % size_width == 0', src/main.rs:174:5
thread '<unnamed>' panicked at 'assertion failed: input.len() % size_width == 0', src/main.rs:174:5
thread '<unnamed>' panicked at 'assertion failed: input.len() % size_width == 0', src/main.rs:174:5
thread '<unnamed>' panicked at 'assertion failed: input.len() % size_width == 0', src/main.rs:174:5
thread '<unnamed>' panicked at 'assertion failed: input.len() % size_width == 0', src/main.rs:174:5
thread '<unnamed>' panicked at 'assertion failed: input.len() % size_width == 0', src/main.rs:174:5
thread '<unnamed>' panicked at 'assertion failed: input.len() % size_width == 0', src/main.rs:174:5
thread '<unnamed>' panicked at 'assertion failed: input.len() % size_width == 0', src/main.rs:174:5
thread '<unnamed>' panicked at 'assertion failed: input.len() % size_width == 0', src/main.rs:174:5
thread '<unnamed>' panicked at 'assertion failed: input.len() % size_width == 0', src/main.rs:174:5
thread '<unnamed>' panicked at 'assertion failed: input.len() % size_width == 0', src/main.rs:174:5
thread '<unnamed>' panicked at 'assertion failed: input.len() % size_width == 0', src/main.rs:174:5
thread '<unnamed>' panicked at 'assertion failed: input.len() % size_width == 0', src/main.rs:174:5
thread '<unnamed>' panicked at 'assertion failed: input.len() % size_width == 0', src/main.rs:174:5
thread '<unnamed>' panicked at 'assertion failed: input.len() % size_width == 0', src/main.rs:174:5
thread '<unnamed>' panicked at 'assertion failed: input.len() % size_width == 0', src/main.rs:174:5
thread '<unnamed>' panicked at 'assertion failed: input.len() % size_width == 0', src/main.rs:174:5
thread '<unnamed>' panicked at 'assertion failed: input.len() % size_width == 0', src/main.rs:174:5
thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: Any { .. }', /root/.cargo/registry/src/github.com-1ecc6299db9ec823/crossbeam-0.3.2/src/scoped.rs:34:43                                [95/1849]
thread '<unnamed>' panicked at 'assertion failed: input.len() % size_width == 0', src/main.rs:174:5
thread '<unnamed>' panicked at 'assertion failed: input.len() % size_width == 0', src/main.rs:174:5
thread '<unnamed>' panicked at 'assertion failed: input.len() % size_width == 0', src/main.rs:174:5
thread '<unnamed>' panicked at 'assertion failed: input.len() % size_width == 0', src/main.rs:174:5
thread '<unnamed>' panicked at 'assertion failed: input.len() % size_width == 0', src/main.rs:174:5
thread '<unnamed>' panicked at 'assertion failed: input.len() % size_width == 0', src/main.rs:174:5
thread '<unnamed>' panicked at 'assertion failed: input.len() % size_width == 0', src/main.rs:174:5
thread '<unnamed>' panicked at 'assertion failed: input.len() % size_width == 0', src/main.rs:174:5
thread '<unnamed>' panicked at 'assertion failed: input.len() % size_width == 0', src/main.rs:174:5
thread '<unnamed>' panicked at 'assertion failed: input.len() % size_width == 0', src/main.rs:174:5
thread '<unnamed>' panicked at 'assertion failed: input.len() % size_width == 0', src/main.rs:174:5
thread '<unnamed>' panicked at 'assertion failed: input.len() % size_width == 0', src/main.rs:174:5
thread '<unnamed>' panicked at 'assertion failed: input.len() % size_width == 0', src/main.rs:174:5
thread '<unnamed>' panicked at 'assertion failed: input.len() % size_width == 0', src/main.rs:174:5
thread '<unnamed>' panicked at 'assertion failed: input.len() % size_width == 0', src/main.rs:174:5
thread '<unnamed>' panicked at 'assertion failed: input.len() % size_width == 0', src/main.rs:174:5
thread '<unnamed>' panicked at 'assertion failed: input.len() % size_width == 0', src/main.rs:174:5
thread '<unnamed>' panicked at 'assertion failed: input.len() % size_width == 0', src/main.rs:174:5
thread '<unnamed>' panicked at 'assertion failed: input.len() % size_width == 0', src/main.rs:174:5
thread '<unnamed>' panicked at 'assertion failed: input.len() % size_width == 0', src/main.rs:174:5

The dataset is C4, the codes worked well on small size sample,but when i scaled to 7.5 GB,this error occurs. I don't know why this happened.
Also another question, I have tried a Chinese language Dataset called wudao200G,and it always gives the warning "There is a match longer than 50,000,000 bytes.",i don't konw whether this is the Dataset's problem,or the code's problem.
Thanks for your help

well,i sloved the first problem(i hope so) by cleaning all the cache in ./output and ./tmp. BUT i still don't know if this code can properly handle Chinese language dataset.

The code is language agnostic and it consumes everything in bytes, so it works with Chinese text.