guillaume-be/rust-tokenizers

Tokenize non-breaking space

sbeckeriv opened this issue · 2 comments

Dearest Maintainer,

I have been using rust-bert and my panic leads me here. I have a nonbreaking space \u{a0} https://en.wikipedia.org/wiki/Non-breaking_space . This appears to be valid and the chars are there. https://play.rust-lang.org/?version=stable&mode=debug&edition=2018&gist=865b0df5864e4d3b68004df0babd3afe

My error:

thread 'main' panicked at 'byte index 11 is not a char boundary; it is inside '\u{a0}' (bytes 10..12) of input.   We', /home/becker/.rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/src/libcore/str/mod.rs:1920:47

it comes from
14: core::str::traits::<impl core::slice::SliceIndex for core::ops::range::Range>::index::{{closure}}
15: rust_tokenizers::preprocessing::tokenizer::tokenization_utils::split_on_regex_with_lookahead

I have looked over the code and i dont understand which index is off. it looks like it calls len_utf8 at a good spot.

example code to trigger the issue.

    let result = panic::catch_unwind(|| {
        let mut summarization_model =
            SummarizationModel::new(Default::default()).expect("summarization_model fail");
        let input = [text];
        summarization_model
            .summarize(" input.\u{a0} \u{a0}We")
            .join(" ")
    });

in the mean time i am going to preprocess \u{a0} to covert it to a space.

Any help in understanding this would be much appreciated.

Thanks again
Becker

Thank you for catching this. This was caused by the (wrong) assumptions that all space characters had a length of 1 unicode point. \u{a0} is for example of length 2 and breaks the code as you illustrated. Pushing a fix that should solve the issue in #31

Hello @guillaume-be

Thanks again! I am happy that I could finally help. I thought the - 1 was because end was off by one :)

thanks again. Look forward to updating.

Becker