`index out of bounds` when using custom dictionary
visig9 opened this issue · 6 comments
Test Case
use jieba_rs::Jieba;
fn main() {
use std::{fs::File, io::BufReader};
let mut dict = BufReader::new(File::open("/home/goodguy/dict.txt.big").unwrap());
let jieba = Jieba::with_dict(&mut dict).unwrap();
println!("{:?}", jieba.cut("Firefox全球市佔率為35%至40%,為全球第二流行的網頁瀏覽器[17][18][19][20][21]。Firefox在某些國家還是最流行的網頁瀏覽器,如在薩摩亞、德國、厄利垂亞及古巴,Firefox市佔率分別為61.05%、38.36%、79.39%及85.93%。據Mozilla統計,截至2014年12月,Firefox在全世界擁有10億使用者[22]。", false));
}
The dict.txt.big
can fetch from https://github.com/fxsjy/jieba/blob/master/extra_dict/dict.txt.big
Result
thread 'main' panicked at 'index out of bounds: the len is 33908 but the index is 67813', /rustc/a53f9df32fbb0b5f4382caaad8f1a46f36ea887c/src/libcore/slice/mod.rs:2695:10
stack backtrace:
0: std::sys::unix::backtrace::tracing::imp::unwind_backtrace
at src/libstd/sys/unix/backtrace/tracing/gcc_s.rs:39
1: std::sys_common::backtrace::_print
at src/libstd/sys_common/backtrace.rs:71
2: std::panicking::default_hook::{{closure}}
at src/libstd/sys_common/backtrace.rs:59
at src/libstd/panicking.rs:197
3: std::panicking::default_hook
at src/libstd/panicking.rs:211
4: std::panicking::rust_panic_with_hook
at src/libstd/panicking.rs:474
5: std::panicking::continue_panic_fmt
at src/libstd/panicking.rs:381
6: rust_begin_unwind
at src/libstd/panicking.rs:308
7: core::panicking::panic_fmt
at src/libcore/panicking.rs:85
8: core::panicking::panic_bounds_check
at src/libcore/panicking.rs:61
9: <usize as core::slice::SliceIndex<[T]>>::index
at /rustc/a53f9df32fbb0b5f4382caaad8f1a46f36ea887c/src/libcore/slice/mod.rs:2695
10: core::slice::<impl core::ops::index::Index<I> for [T]>::index
at /rustc/a53f9df32fbb0b5f4382caaad8f1a46f36ea887c/src/libcore/slice/mod.rs:2552
11: <alloc::vec::Vec<T> as core::ops::index::Index<I>>::index
at /rustc/a53f9df32fbb0b5f4382caaad8f1a46f36ea887c/src/liballoc/vec.rs:1687
12: jieba_rs::Jieba::calc::{{closure}}
at /home/goodguy/.cargo/registry/src/github.com-1ecc6299db9ec823/jieba-rs-0.4.9/src/lib.rs:359
13: core::ops::function::impls::<impl core::ops::function::FnOnce<A> for &mut F>::call_once
at /rustc/a53f9df32fbb0b5f4382caaad8f1a46f36ea887c/src/libcore/ops/function.rs:279
14: core::option::Option<T>::map
at /rustc/a53f9df32fbb0b5f4382caaad8f1a46f36ea887c/src/libcore/option.rs:416
15: <core::iter::adapters::Map<I,F> as core::iter::traits::iterator::Iterator>::next
at /rustc/a53f9df32fbb0b5f4382caaad8f1a46f36ea887c/src/libcore/iter/adapters/mod.rs:570
16: core::iter::traits::iterator::select_fold1
at /rustc/a53f9df32fbb0b5f4382caaad8f1a46f36ea887c/src/libcore/iter/traits/iterator.rs:2599
17: core::iter::traits::iterator::Iterator::max_by
at /rustc/a53f9df32fbb0b5f4382caaad8f1a46f36ea887c/src/libcore/iter/traits/iterator.rs:2082
18: jieba_rs::Jieba::calc
at /home/goodguy/.cargo/registry/src/github.com-1ecc6299db9ec823/jieba-rs-0.4.9/src/lib.rs:349
19: jieba_rs::Jieba::cut_dag_no_hmm
at /home/goodguy/.cargo/registry/src/github.com-1ecc6299db9ec823/jieba-rs-0.4.9/src/lib.rs:421
20: jieba_rs::Jieba::cut_internal
at /home/goodguy/.cargo/registry/src/github.com-1ecc6299db9ec823/jieba-rs-0.4.9/src/lib.rs:572
21: jieba_rs::Jieba::cut
at /home/goodguy/.cargo/registry/src/github.com-1ecc6299db9ec823/jieba-rs-0.4.9/src/lib.rs:612
22: testproj::main
at src/main.rs:7
23: std::rt::lang_start::{{closure}}
at /rustc/a53f9df32fbb0b5f4382caaad8f1a46f36ea887c/src/libstd/rt.rs:64
24: std::panicking::try::do_call
at src/libstd/rt.rs:49
at src/libstd/panicking.rs:293
25: __rust_maybe_catch_panic
at src/libpanic_unwind/lib.rs:85
26: std::rt::lang_start_internal
at src/libstd/panicking.rs:272
at src/libstd/panic.rs:394
at src/libstd/rt.rs:48
27: std::rt::lang_start
at /rustc/a53f9df32fbb0b5f4382caaad8f1a46f36ea887c/src/libstd/rt.rs:64
28: main
29: __libc_start_main
30: _start
Clue
Look into the dict.txt.big
...
些許端 2 m
些许 163 d
些许端 2 m
些須 3 d
些须 3 d
亜 20 zg
亝 6 zg
亞 13 zg
亞 5789 j
亞丁 57 nrt
亞丁港 4 ns
亞丁灣 37 ns
亞世 6 nrt
亞世達 2 nr
亞乃夫 2 nr
亞之傑 3 nr
亞乙基 2 nr
亞伊傑 2 nr
亞伯 2 ns
亞伯奎 6 nr
亞伯拉罕 19 nrt
... when I delete all the lines after 亞 5789 j
(line number 33908
), the panic disappeared. The result look like following:
["Firefox", "全", "球", "市", "佔", "率", "為", "35", "%", "至", "40", "%", ",", "為", "全", "球", "第", "二流", "行", "的", "網", "頁", "瀏", "覽", "器", "[", "17", "]", "[", "18", "]", "[", "19", "]", "[", "20", "]", "[", "21", "]", "。", "Firefox", "在", "某", "些", "國", "家", "還", "是", "最", "流", "行", "的", "網", "頁", "瀏", "覽", "器", ",", "如", "在", "薩", "摩", "亞", "、", "德", "國", "、", "厄", "利", "垂", "亞", "及", "古", "巴", ",", "Firefox", "市", "佔", "率", "分", "別", "為", "61", ".", "05", "%", "、", "38", ".", "36", "%", "、", "79", ".", "39", "%", "及", "85", ".", "93", "%", "。", "據", "Mozilla", "統", "計", ",", "截", "至", "2014", "年", "12", "月", ",", "Firefox", "在", "全", "世界", "擁", "有", "10", "億", "使", "用", "者", "[", "22", "]", "。"]
Env
- debian buster
- rustc 1.36.0 stable
- jieba-rs 0.4.9
I think this is caused by the fact that we didn't consider that there would be duplications in the dict.
ok, I can have a look.
It is related to how we expect the load_dict
should be called, and its semantic.
If you call load_dict
and call it again, exact_match_search
shouldn't always return None
.
Since the method is public, the method could be called again and again. Then there are a few approaches we could choose.
- Merge the dictionaries, and resolve the conflict by choosing the frequencies appear in the dictionary called later.
- Merge the dictionaries, but resort to other conflict resolution.
- Throw everything from the previous dictionaries away and use everything in the last call.
For the code we have right now it is 1
, though from a brief overlook I couldn't figure out why it causes panic, and why your PR by applying self.cedar.update(word, word_id);
in the for loop has fixed the issue. Since update
should already be covered by build
in cedarwood, which is basically a for loop for update.
/// Build the double array trie from the given key value pairs
#[allow(dead_code)]
pub fn build(&mut self, key_values: &[(&str, i32)]) {
for (key, value) in key_values {
self.update(key, *value);
}
}
I guess there might be hidden bugs in cedarwood, I need to spend more time on it, but in the meantime I am good with the PR though I am not exactly sure why it fixed the panic,
MnO2/cedarwood@858e903 This commit should fix the issue.