janhq/ichigo

chore: Augmenting current Vietnamese speech dataset

Opened this issue · 1 comments

Problem

Most of the current Vietnamese dataset has the same pipeline (Youtube -> VAD -> Whisper -> Normalization). This pipeline cleaned all of the punctuation in the transcription makes it lost information.
e.g. Bud500

Screenshot 2024-12-12 at 09 32 31

or gigiaspeech2
Screenshot 2024-12-12 at 09 37 30

Goal

Improve current dataset by adding back nature punctation

Draft solution

  • Use Whisperv3 large to transcribe audio in those dataset -> whisper_transcription_format

  • Use Llama3.2 8B to use the ground_truth and whisper_transcription_format to reformat the ground_truth

  • Note: We only use the structure of WhisperLarge not the label of it

  • Architecture:

Screenshot 2024-12-12 at 09 48 12

Tasklist

  • Transcribe bud500, gigaspeech refined vi using Whisper large v3
  • Setup distil label pipeline
  • QA dataset

Need to discuss more on the pipeline cause:

  • result from Whisper large V3 is bad for bud500 audio.
  • Lllama tend to complete, modify the label instead of correcting typo only