Suffix array collect src/main.rs:174 assertion failed: input.len() % size_width == 0
leoMesss opened this issue · 2 comments
leoMesss commented
Cleaning up
Finished dev [optimized + debuginfo] target(s) in 3.32s
Running `target/debug/dedup_dataset self-similar --data-file output/temp_text.txt --length-threshold 100 --cache-dir ./tmp/cache --num-threads 128`
Start load!
0 / 60318160
3093441 / 60318161
9500721 / 60318161
19001441 / 60318161
28502161 / 60318161
34909441 / 60318161
44410161 / 60318161
53910881 / 60318161
Duplicates found: 191378774
Total time taken: 70972ms
Finished dev [optimized + debuginfo] target(s) in 0.89s
Running `target/debug/dedup_dataset collect --data-file output/temp_text.txt --length-threshold 100 --cache-dir ./tmp/cache`
thread '<unnamed>' panicked at 'assertion failed: input.len() % size_width == 0', src/main.rs:174:5
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
thread '<unnamed>' panicked at 'assertion failed: input.len() % size_width == 0', src/main.rs:174:5
thread '<unnamed>' panicked at 'assertion failed: input.len() % size_width == 0', src/main.rs:174:5
thread '<unnamed>' panicked at 'assertion failed: input.len() % size_width == 0', src/main.rs:174:5
thread '<unnamed>' panicked at 'assertion failed: input.len() % size_width == 0', src/main.rs:174:5
thread '<unnamed>' panicked at 'assertion failed: input.len() % size_width == 0', src/main.rs:174:5
thread '<unnamed>' panicked at 'assertion failed: input.len() % size_width == 0', src/main.rs:174:5
thread '<unnamed>' panicked at 'assertion failed: input.len() % size_width == 0', src/main.rs:174:5
thread '<unnamed>' panicked at 'assertion failed: input.len() % size_width == 0', src/main.rs:174:5
thread '<unnamed>' panicked at 'assertion failed: input.len() % size_width == 0', src/main.rs:174:5
thread '<unnamed>' panicked at 'assertion failed: input.len() % size_width == 0', src/main.rs:174:5
thread '<unnamed>' panicked at 'assertion failed: input.len() % size_width == 0', src/main.rs:174:5
thread '<unnamed>' panicked at 'assertion failed: input.len() % size_width == 0', src/main.rs:174:5
thread '<unnamed>' panicked at 'assertion failed: input.len() % size_width == 0', src/main.rs:174:5
thread '<unnamed>' panicked at 'assertion failed: input.len() % size_width == 0', src/main.rs:174:5
thread '<unnamed>' panicked at 'assertion failed: input.len() % size_width == 0', src/main.rs:174:5
thread '<unnamed>' panicked at 'assertion failed: input.len() % size_width == 0', src/main.rs:174:5
thread '<unnamed>' panicked at 'assertion failed: input.len() % size_width == 0', src/main.rs:174:5
thread '<unnamed>' panicked at 'assertion failed: input.len() % size_width == 0', src/main.rs:174:5
thread '<unnamed>' panicked at 'assertion failed: input.len() % size_width == 0', src/main.rs:174:5
thread '<unnamed>' panicked at 'assertion failed: input.len() % size_width == 0', src/main.rs:174:5
thread '<unnamed>' panicked at 'assertion failed: input.len() % size_width == 0', src/main.rs:174:5
thread '<unnamed>' panicked at 'assertion failed: input.len() % size_width == 0', src/main.rs:174:5
thread '<unnamed>' panicked at 'assertion failed: input.len() % size_width == 0', src/main.rs:174:5
thread '<unnamed>' panicked at 'assertion failed: input.len() % size_width == 0', src/main.rs:174:5
thread '<unnamed>' panicked at 'assertion failed: input.len() % size_width == 0', src/main.rs:174:5
thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: Any { .. }', /root/.cargo/registry/src/github.com-1ecc6299db9ec823/crossbeam-0.3.2/src/scoped.rs:34:43 [95/1849]
thread '<unnamed>' panicked at 'assertion failed: input.len() % size_width == 0', src/main.rs:174:5
thread '<unnamed>' panicked at 'assertion failed: input.len() % size_width == 0', src/main.rs:174:5
thread '<unnamed>' panicked at 'assertion failed: input.len() % size_width == 0', src/main.rs:174:5
thread '<unnamed>' panicked at 'assertion failed: input.len() % size_width == 0', src/main.rs:174:5
thread '<unnamed>' panicked at 'assertion failed: input.len() % size_width == 0', src/main.rs:174:5
thread '<unnamed>' panicked at 'assertion failed: input.len() % size_width == 0', src/main.rs:174:5
thread '<unnamed>' panicked at 'assertion failed: input.len() % size_width == 0', src/main.rs:174:5
thread '<unnamed>' panicked at 'assertion failed: input.len() % size_width == 0', src/main.rs:174:5
thread '<unnamed>' panicked at 'assertion failed: input.len() % size_width == 0', src/main.rs:174:5
thread '<unnamed>' panicked at 'assertion failed: input.len() % size_width == 0', src/main.rs:174:5
thread '<unnamed>' panicked at 'assertion failed: input.len() % size_width == 0', src/main.rs:174:5
thread '<unnamed>' panicked at 'assertion failed: input.len() % size_width == 0', src/main.rs:174:5
thread '<unnamed>' panicked at 'assertion failed: input.len() % size_width == 0', src/main.rs:174:5
thread '<unnamed>' panicked at 'assertion failed: input.len() % size_width == 0', src/main.rs:174:5
thread '<unnamed>' panicked at 'assertion failed: input.len() % size_width == 0', src/main.rs:174:5
thread '<unnamed>' panicked at 'assertion failed: input.len() % size_width == 0', src/main.rs:174:5
thread '<unnamed>' panicked at 'assertion failed: input.len() % size_width == 0', src/main.rs:174:5
thread '<unnamed>' panicked at 'assertion failed: input.len() % size_width == 0', src/main.rs:174:5
thread '<unnamed>' panicked at 'assertion failed: input.len() % size_width == 0', src/main.rs:174:5
thread '<unnamed>' panicked at 'assertion failed: input.len() % size_width == 0', src/main.rs:174:5
The dataset is C4, the codes worked well on small size sample,but when i scaled to 7.5 GB,this error occurs. I don't know why this happened.
Also another question, I have tried a Chinese language Dataset called wudao200G,and it always gives the warning "There is a match longer than 50,000,000 bytes.",i don't konw whether this is the Dataset's problem,or the code's problem.
Thanks for your help
leoMesss commented
well,i sloved the first problem(i hope so) by cleaning all the cache in ./output and ./tmp. BUT i still don't know if this code can properly handle Chinese language dataset.
ChenghaoMou commented
The code is language agnostic and it consumes everything in bytes, so it works with Chinese text.