WhisperContext::tokenize() incorrectly passes raw &str pointer to whisper_tokenize(), which expects a C-style string
jcsoo opened this issue · 2 comments
WhisperContext::tokenize(...)
currently calls whisper_rs_sys::whisper_tokenize
with a raw pointer to a Rust &str:
let ret = unsafe {
whisper_rs_sys::whisper_tokenize(
self.ctx,
text.as_ptr() as *const _,
tokens.as_mut_ptr(),
max_tokens as c_int,
)
};
On the other hand, whisper_tokenize()
expects a null-terminated C string and will read until it encounters a zero byte, which could be well beyond the end of the string. When passing in a string literal, whisper_tokenize()
will likely tokenize nearby bytes from the program binary.
Unfortunately it is not possible to convert a Rust &str directly to an &CStr because it requires appending a terminal zero byte. This means that to keep the API as-is, you need to allocate to create a CString that can be passed to whisper_tokenize, which doesn't seem ideal but will probably have minimal performance impact.
Alternatively, WhisperContext::tokenize()
could be changed to take a &CStr instead of a &str.
Allocating internally would likely have the least API churn, as making users create CStr
s would probably be very annoying over time. As with the other issue, a PR is welcomed. If not, I'll clean this up at some point in the future.