chore: Augmenting current Vietnamese speech dataset
Opened this issue · 1 comments
hahuyhoang411 commented
Problem
Most of the current Vietnamese dataset has the same pipeline (Youtube -> VAD -> Whisper -> Normalization). This pipeline cleaned all of the punctuation in the transcription makes it lost information.
e.g. Bud500
or gigiaspeech2
Goal
Improve current dataset by adding back nature punctation
Draft solution
-
Use Whisperv3 large to transcribe audio in those dataset ->
whisper_transcription_format
-
Use Llama3.2 8B to use the
ground_truth
andwhisper_transcription_format
to reformat theground_truth
-
Note: We only use the structure of WhisperLarge not the label of it
-
Architecture:
Tasklist
- Transcribe bud500, gigaspeech refined vi using Whisper large v3
- Setup distil label pipeline
- QA dataset
bachvudinh commented
Need to discuss more on the pipeline cause:
- result from Whisper large V3 is bad for bud500 audio.
- Lllama tend to complete, modify the label instead of correcting typo only