tazz4843/whisper-rs

WhisperContext::tokenize() incorrectly passes raw &str pointer to whisper_tokenize(), which expects a C-style string

jcsoo opened this issue · 2 comments

jcsoo commented

WhisperContext::tokenize(...) currently calls whisper_rs_sys::whisper_tokenize with a raw pointer to a Rust &str:

let ret = unsafe {
            whisper_rs_sys::whisper_tokenize(
                self.ctx,
                text.as_ptr() as *const _,
                tokens.as_mut_ptr(),
                max_tokens as c_int,
            )
        };

On the other hand, whisper_tokenize() expects a null-terminated C string and will read until it encounters a zero byte, which could be well beyond the end of the string. When passing in a string literal, whisper_tokenize() will likely tokenize nearby bytes from the program binary.

Unfortunately it is not possible to convert a Rust &str directly to an &CStr because it requires appending a terminal zero byte. This means that to keep the API as-is, you need to allocate to create a CString that can be passed to whisper_tokenize, which doesn't seem ideal but will probably have minimal performance impact.

Alternatively, WhisperContext::tokenize() could be changed to take a &CStr instead of a &str.

Allocating internally would likely have the least API churn, as making users create CStrs would probably be very annoying over time. As with the other issue, a PR is welcomed. If not, I'll clean this up at some point in the future.

Closed by #47