chore: Augmenting current Vietnamese speech dataset

Question

chore: Augmenting current Vietnamese speech dataset

Opened this issue 15 days ago · 1 comments

Problem

Most of the current Vietnamese dataset has the same pipeline (Youtube -> VAD -> Whisper -> Normalization). This pipeline cleaned all of the punctuation in the transcription makes it lost information.
e.g. Bud500

or gigiaspeech2

Goal

Improve current dataset by adding back nature punctation

Draft solution

Use Whisperv3 large to transcribe audio in those dataset -> whisper_transcription_format
Use Llama3.2 8B to use the ground_truth and whisper_transcription_format to reformat the ground_truth
Note: We only use the structure of WhisperLarge not the label of it
Architecture:

Tasklist

Transcribe bud500, gigaspeech refined vi using Whisper large v3
Setup distil label pipeline
QA dataset

Answer 1 · 2024-12-16T02:29:17.000Z

Need to discuss more on the pipeline cause:

result from Whisper large V3 is bad for bud500 audio.
Lllama tend to complete, modify the label instead of correcting typo only