tazz4843/whisper-rs

Invalid utf-8

Closed this issue · 11 comments

image
The song in hebrew:
muminim hebrew.zip

This is an upstream issue, not something we can control. I run into this myself with my own services, and I just log it and ignore the output.

Doing some digging I found the following:
ggerganov/whisper.cpp#1098
ggerganov/whisper.cpp#1118

@tazz4843
How can I ignore the errors and take only some of the transcribed data? or if it's in some languages it won't work at all?
I can't transcribe in some langauges at all.

I checked whisper.cpp with his cli example.
He has that issue there too but in terminal only.
If I write the output of whisper.cpp to file it works well,
So I think it's still encoding issue in whisper-rs
It happens here
whisper_state.rs#L481

We don't do anything with the string, this would be a bug in Rust's std string library, which there's essentially no chance of. As such this means it must be whisper.cpp returning an invalid UTF-8 string. We could return the raw bytes on error, but those are somewhat useless without being able to parse it unless you want to parse only up to the index where it fails (which would be a valid use case and if you want this added I can do so).

UTF-8 is designed specifically to be able to recover from invalid strings, right?
image
You could discard whatever is invalid (seems best to me); or as this crate (I think — it is dense and I didn't care to verify after glancing at the code) does, return invalid codepoints as valid UTF-8 had their prefixes been right.

0xxxxxxx -> great, we're back to ASCII, continue
10xxxxxx -> crap, invalid
110xxxxx -> great, back to valid input
10xxxxxx  -> end of the last char
10xxxxxx -> invalid
11110xxx -> start of 4 byte char
11xxxxxx -> invalid
11110xxx -> start of 4 byte char
10xxxxxx
10xxxxxx
10xxxxxx -> end of valid 4 byte char

you could still parse out of there 0xxxxxxx 110xxxxx 10xxxxxx 11110xxx 10xxxxxx 10xxxxxxx 10xxxxxx and assuming what you had was 1 invalid codepoint and a ton of crap, it will probably be fine.

There is String::from_utf8_lossy for that which does throw away information to get a valid UTF-8 string

I still experience this issue, I'm not sure wether it's in my control or whisper-rs need to be changed
thewh1teagle/vibe#34
Can I ignore these utf-8 errors?

Remind me in a few days and I can add a function to infallibly convert.

Hey, just a reminder
Many people opened issue related to that in vibe/issues so I hope to solve it.
I think that it's better to receive some invalid characters than fail the whole transcription

Should be solved in f4ea0d9

Should be solved in f4ea0d9

Thanks, I wasn't able to use it but it helped me understand where is the problem so I added
github.com/thewh1teagle/whisper-rs/ee93930 and looks like it fixed the issue (and I don't even see invalid characters).
I can create PR from that if you want :)