Tokenize non-breaking space
sbeckeriv opened this issue · 2 comments
Dearest Maintainer,
I have been using rust-bert and my panic leads me here. I have a nonbreaking space \u{a0} https://en.wikipedia.org/wiki/Non-breaking_space . This appears to be valid and the chars are there. https://play.rust-lang.org/?version=stable&mode=debug&edition=2018&gist=865b0df5864e4d3b68004df0babd3afe
My error:
thread 'main' panicked at 'byte index 11 is not a char boundary; it is inside '\u{a0}' (bytes 10..12) of input. We
', /home/becker/.rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/src/libcore/str/mod.rs:1920:47
it comes from
14: core::str::traits::<impl core::slice::SliceIndex for core::ops::range::Range>::index::{{closure}}
15: rust_tokenizers::preprocessing::tokenizer::tokenization_utils::split_on_regex_with_lookahead
I have looked over the code and i dont understand which index is off. it looks like it calls len_utf8 at a good spot.
example code to trigger the issue.
let result = panic::catch_unwind(|| {
let mut summarization_model =
SummarizationModel::new(Default::default()).expect("summarization_model fail");
let input = [text];
summarization_model
.summarize(" input.\u{a0} \u{a0}We")
.join(" ")
});
in the mean time i am going to preprocess \u{a0} to covert it to a space.
Any help in understanding this would be much appreciated.
Thanks again
Becker
Thank you for catching this. This was caused by the (wrong) assumptions that all space characters had a length of 1 unicode point. \u{a0}
is for example of length 2 and breaks the code as you illustrated. Pushing a fix that should solve the issue in #31
Hello @guillaume-be
Thanks again! I am happy that I could finally help. I thought the - 1
was because end was off by one :)
thanks again. Look forward to updating.
Becker