zurawiki/tiktoken-rs

Incomplete utf-8 byte sequence from index 0

BohuTANG opened this issue · 1 comments

The code is:

#[test]
fn test_token() {
    let input = "🍌This is a sentence   with spaces, hahhahah haha ha";
    let rke = r50k_base()?;
    let _ = rke.split_by_token(input, true).unwrap();
}

Error:

called `Result::unwrap()` on an `Err` value: incomplete utf-8 byte sequence from index 0

Caused by this line:

String::from_utf8(token)

It should be String::from_utf8_lossy?

Thanks for the clear repro. Unfortunately, not every string sequence can be properly split into Unicode-compatible chunks. I added test cases to ensure splitting and round_trip work as expected in #24.