Whisper failed to recognize Chinese contents

Question

Whisper failed to recognize Chinese contents

Closed this issue 5 months ago · 1 comments

Here is the code that I used to extract subtitles from a video that I recorded. So it uses ffmpeg to extract the audio track first, then send it to whisper-rs for inferencing. However, the extracted subtitle is irrelevant to the video that I recorded. I tried to tweak around the BeamSearch, best_of as well as the initial prompt, but no success.

Am I doing wrong with my code?

use rodio::Source;
use whisper_rs::{WhisperContext, WhisperContextParameters, FullParams, SamplingStrategy};

mod audio_conversion;

fn main() {

    // extract an audio track from the video
    use std::path::Path;

    let videopath = Path::new("./video.mov");
    
    if !videopath.exists() {
        println!("video file is not seen.");
    } else {
        println!("{}", videopath.file_name().unwrap().to_str().unwrap());
    }

    match audio_conversion::AudioVideoConverter::convert_video_to_audio(
        videopath.to_str().unwrap(), 
        "./extracted_audio.wav"
    ) {
        Ok(_) => println!("audio extraction finished."),
        Err(error) => println!("{:?}", error)
    };
    
    // getting into whisper
    let mut whisper_parameters = FullParams::new(
        SamplingStrategy::Greedy { best_of: 0 }
    );

    whisper_parameters.set_language(Some("zh" as &str));
    // whisper_parameters.set_initial_prompt("以下是普通话句子：" as &str);
    whisper_parameters.set_print_realtime(true);

    let whisper_context = match WhisperContext::new_with_params(
        "ggml-large-v2-q5_0.bin", 
        WhisperContextParameters::default()
    ) {
        Ok(result) => result,
        Err(error) => panic!("{}", error)
    };

    use std::fs::File;
    use std::io::BufReader;
    use rodio::Decoder;

    let audio_track = match File::open("./extracted_audio.wav") {
        Ok(result) => BufReader::new(result),
        Err(error) => panic!("{}", error)
    };

    let decoded_audio_track: Vec<i16> = match Decoder::new(audio_track) {
        Ok(result) => result
            .convert_samples::<i16>()
            .map(|sample| sample)
            .collect(),
        Err(error) => panic!("{}", error)
    };

    let mut samples: Vec<f32> = vec![0.0f32; decoded_audio_track.len()];
    whisper_rs::convert_integer_to_float_audio(&decoded_audio_track, &mut samples)
        .expect("sample conversion failed.");

    // now we can run the model
	let mut state = whisper_context.create_state().expect("failed to create state");
	state
		.full(whisper_parameters, &&samples[..])
		.expect("failed to run model");

	// fetch the results
	let num_segments = state
		.full_n_segments()
		.expect("failed to get number of segments");
	for i in 0..num_segments {
		let segment = state
			.full_get_segment_text(i)
			.expect("failed to get segment");
		let start_timestamp = state
			.full_get_segment_t0(i)
			.expect("failed to get segment start timestamp");
		let end_timestamp = state
			.full_get_segment_t1(i)
			.expect("failed to get segment end timestamp");
		println!("[{} - {}]: {}", start_timestamp, end_timestamp, segment);
	}

}

Answer 1 · 2024-07-25T10:52:13.000Z

Closed the issue. As I figured, the issue is on the model, since the English inference is okay. After I changed the model to a fine tuned version, it works just fine.