When transcribing Chinese audio, using whisper_full_get_segment_text can return the correct text, but using whisper_full_get_token_text might result in NULL.

Question

When transcribing Chinese audio, using whisper_full_get_segment_text can return the correct text, but using whisper_full_get_token_text might result in NULL.

ppcfan opened this issue a month ago · 0 comments

I encountered an issue while transcribing Chinese audio. After transcribing a segment of Chinese audio with whisper_full(...), I can obtain the correct Chinese text using whisper_full_get_segment_text. However, when I iterate over each token and call whisper_full_get_token_text, some tokens return NULL. I suspect this might be due to a single Chinese character corresponding to multiple tokens. If this is the case, how does whisper_full_get_segment_text map multiple tokens to a single Chinese character? Is there a method I can use to merge tokens and then output the correct token text? Thank you.