training run: tuning ichigo-quantizer
Opened this issue · 15 comments
Goal
Tuning to find the best params for Ichigo Quantizer
Hypothesis
- Current default codebook (512) is too small for multilingual, and also training from scratch needs so much data to converge
- Training on mixed dataset (Eng-Vie) maintains the performance of original WhisperVQ
Task:
- Adding code to init codebook embedding from WhisperVQ checkpoint
- Fix training loss stuck
- Adding code to preprocess multi training datasets (Filtered LibriTTS-R + Viet-Bud500)
- Adding inference quantizer and fill Google Sheet
- Packing short audio (~3s) data (up to 30s)
- Verify KL loss
- Add WER for Vietnamese
- Training on high-quality Vietnamese dataset (ViVoice)
- Refine Bud500 dataset
- Scale training on large dataset (ViVoice, MLS, etc.)
Training Result:
The training result here: #146
-
Acknowledgment:
Great work by @tuanlda78202 on adding code to use weights from the WhisperVQ 7-language quantizer checkpoint. -
Pull Request:
PR janhq/WhisperSpeech#7 on WhisperSpeech -
Impact:
This approach could save significant time compared to training from scratch.
-
Observation:
Dataset mostly contains sequences <200 tokens (30s audio), leading to high KL loss.
Data distribution:
-
Testing:
- Adjusted
max tokens
to 20, 50, and 200. - Results:
- At
max tokens = 20
, KL loss dropped to 4. - At
max tokens = 200
, KL loss > 10. - Loss test:
- At
- Adjusted
-
Problem:
Excessive padding tokens during training inflated the loss. -
Solution:
Implement dataset packing to reduce padding tokens.
Current PR (WIP): janhq/WhisperSpeech#8
Problem
- Default codebook size of
WhisperVQ
is so small (512), it may not be effective for training on multilingual datasets; also, training from scratch requires so much data and time to converge to the best loss. - High loss when training and stuck stable, it cannot learn more.
- Training only on new language datasets may decrease the performance of the English checkpoint pretrained VQ
- Training on short audio (~3s) with padding is very inefficient.
- KL loss can be divergence predictions output.
Solution
- Init weights from a checkpoint of WhisperVQ-7lang (trained on multiple languages, good result).
- Try to modify the architecture (codebook, dim), or data pipeline, do experiments, and verify hypotheses.
- Trained on mixed language datasets
- Concat multiple short audio files to a long audio file (30s)
- Turn off KL loss when training
Implementation
- Init weights from a checkpoint of
WhisperVQ-7lang
: WhisperVQ checkpoints have a codebook size (512) mismatch with our current increased codebook size (1024), so we filled the first 512 weights into the models and ran an experiment onavg embedding
, Kaiming init and try duplicate next 512 with random noise with remaining tokens.
- Try with different factors of KL loss to check if it impacts the main model.
Name | Runtime | batch_size | trainer/global_step | loss/total_train_step | loss/ce | loss/kl | loss/commit_loss | codebook/used_codes | codebook/utilization |
---|---|---|---|---|---|---|---|---|---|
init ckpt dim 1024 - kl 5 | 31m 1s | 42 | 782 | 78.53279 | 1.11499 | 15.48318 | 0.0018883 | 858 | 83.78906 |
init ckpt dim 1024 - kl 2 | 1h 57m 1s | 42 | 2924 | 30.53683 | 0.91597 | 14.80941 | 0.0020393 | 875 | 85.44922 |
init ckpt dim 1024 - kl 1.5 | 1h 58m 7s | 42 | 2939 | 23.34618 | 0.84657 | 14.99841 | 0.0019927 | 886 | 86.52344 |
init ckpt dim 1024 - kl 3 | 2h 43m 4s | 42 | 3726 | 45.72793 | 1.0146 | 14.90375 | 0.0020783 | 868 | 84.76563 |
- Add
mask logit
before thesoftmax
functionKL loss
to check if it can be a factor in the loss model.
Name | trainer/global_step | codebook/used_codes_step | codebook/utilization_step | loss/ce_loss_st | loss/commit_loss_step | loss/kl_loss_step | loss/total_train_step |
---|---|---|---|---|---|---|---|
diana_lavenderblush | 3335 | 889 | 86.81641 | 0.29703 | 0.0013738 | 0.3107 | 0.60911 |
- Check the raw dataset, including the length of the audio and distribution of text tokens; we found that in the original implementation of WhisperSpeech, the max token value was very high (200), and our data was so small (max 20) that it led to
more padding
and made the KL loss value still very high. We verified this hypothesis with differentmax_token
values and saw that with max token equal to the max number of tokens in the dataset, it returned the lowest loss.
Name | Runtime | batch_size | trainer/global_step | loss/total_train_step | loss/ce | loss/kl | loss/commit_loss | codebook/used_codes | codebook/utilization |
---|---|---|---|---|---|---|---|---|---|
init_ckpt dim 1024 - 512 later random - bs42 - max_token=20 | 7h 32m 19s | 42 | 15099 | 2.75478 | 0.82511 | 1.92655 | 0.00311208 | 968 | 94.53125 |
init_ckpt dim 1024 - 512 later random - bs42 - max_token=50 | 7h 43m 51s | 42 | 15099 | 4.71717 | 0.7786 | 3.93544 | 0.00313314 | 991 | 96.77734 |
init_ckpt dim 1024 - 512 later random - bs8 | 1h 30m 42s | 8 | 9999 | 15.83308 | 0.76369 | 15.06781 | 0.00151839 | 888 | 86.71875 |
init_ckpt dim 1024 - 512 later random - bs42 - max_token=200 | 1h 5m 41s | 42 | 1689 | 15.96285 | 0.83534 | 15.12557 | 0.001936 | 983 | 95.99609 |
init_ckpt dim 1024 - 512 later avg | 3h 21m 52s | 42 | 4999 | 15.39613 | 0.80295 | 14.59104 | 0.00213388 | 865 | 84.47266 |
- Test with
trained quantizer
, return better result compare withwhisper medium
(has many hallucination), result in this sheet.
Preview table of comparison
Audio ID | Ground Truth | Trained Quantizer Output | Whisper Output |
---|---|---|---|
audio_0_6 | các bác sĩ có thể chăm sóc người bệnh | Các bác sĩ có thể chăm sóc người bệnh | để các bác sĩ có thể chăm sóc người bệnh. |
audio_0_7 | em bây giờ mới là hiện tại của anh ấy | Em bây giờ mới là hiện tại của anh ấy | Em bây giờ mới là hiện tại của anh ấy |
audio_0_8 | thôi anh đừng nói gì nữa tôi chưa đủ khổ | Thôi anh đừng nói gì nữa, chưa đủ khổ | Thôi anh đừng nói gì nữa. Tôi chưa đủ khổ. |
- Experiment with removing
special_tokens
when encoding text input; the result is very bad.
Name | Runtime | batch_ | trainer/global_step | loss/total_train_step | loss/ce | loss/kl | loss/commit_loss | codebook/used_codes | codebook/utilization |
---|---|---|---|---|---|---|---|---|---|
no special tokens | 4h 27m 40 | 80 | 6522 | 5.34698 | 3.75046 | 1.59528 | 0.0012379 | 867 | 84.58537 |
-
For speed up training, we removed
WebDataset
implemented in WhisperSpeech and using native DataLoader PyTorch, it reduced time data processed in CPU, boosted up GPUs utilization 100% and reduced time training from 13h -> 5h on a single A6000 GPU, supported multi-gpus training DDP -
Trained on mixed (English + Vietnamese) data, including LibriTTS-R (27Gb, 112k train samples) for English, and Viet-Bud500 (98Gb, 630k samples), with weighted sampling between Vietnamese (70%) and English (30%) dataset (apply on batch training distribution.
Name | Runtime | batch_size | trainer/global_step | loss/total_train_step | loss/ce | loss/kl | loss/commit_loss | codebook/used_codes | codebook/utilization |
---|---|---|---|---|---|---|---|---|---|
init ckpt 1024 - 512 random - bs80 - max_token20 - mix_data | 7952 | 80 | 3191 | N/A | 0.5281 | 1.4206 | 0.0020 | 964 | 94.05% |
init ckpt 2048 - 512 dup noise - bs80 - max_token20 - mix_data | 24916 | 80 | 9338 | 1.5538 | 0.4797 | 1.0716 | 0.0025 | 1928 | 94.09% |
init ckpt 1024 - 512 dup noise - bs80 - max_token20 - mix_data | 57964 | 80 | 23329 | 1.6326 | 0.4719 | 1.1567 | 0.0040 | 1006 | 98.17% |
init ckpt 2048 - 512 random - bs80 - max_token20 - mix_data | 58649 | 80 | 23329 | 1.3866 | 0.4406 | 0.9432 | 0.0028 | 1557 | 75.99% |
- Concat multiple short audios to a long audio (30s), set
max_token
=200, reduced training time 10x times (8 minutes/epoch) on mixed datasets, because when concatenate 30s, the current samples will group ~10, leading to a reduced number of training samples 10x.
Name | Runtime | batch_size | trainer/global_step | loss/total_train_step | loss/ce | loss/kl | loss/commit_loss | codebook/used_codes | codebook/utilization |
---|---|---|---|---|---|---|---|---|---|
scratch 2048 - largev3 - bs24 - max_token200 - mix_data_73 - 100e - ddp8 | 68056 | 24 | 32263 | 6.444209575653076 | 0.4545549750328064 | 5.978719234466553 | 0.010935629718005655 | 1266.125 | 61.792335510253906 |
init ckpt 2048 - 512 dup_noise - bs42 - max_token200 - mix_data_73 - wo_w_loss - 100e - ddp8 | 50272 | 42 | 24899 | 12.359258651733398 | 0.4022634625434875 | 11.945953369140623 | 0.011041682213544846 | 1974.25 | 96.35187530517578 |
- Change Whisper from
medium
tolarge-v3
. Found out many hallucination responses and mixed dataset lead to English bias when training, resulting inpredictions_loss=5.47 - largev3
.
Preview table of comparison
Audio ID | Ground Truth | Trained Quantizer Output | Whisper Output |
---|---|---|---|
audio_0_13 | nơi đây và em thích con người ở đây em | one đây và em thích con người ở đây em | Hãy subscribe cho kênh Ghiền Mì Gõ Để không bỏ lỡ những video hấp dẫn |
audio_0_15 | cước được tắm suối mịn màng trắng sáng | ure được tắm suối miịn màng sắng sáng | Hãy subscribe cho kênh Ghiền Mì Gõ Để không bỏ lỡ những video hấp dẫn |
audio_1_32 | dũng mới cứu được tôi thôi chứ không còn | .ng can cứu được tôi thôi chứ không còn stubborn stubborn stubborn stubborn | Hãy subscribe cho kênh Ghiền Mì Gõ Để không bỏ lỡ những video hấp dẫn |
- Turn off KL loss, resume weights from phase 1 (trained 100e), training result: very good output predictions (better than whisper
medium
), updated inference at epoch 21 in here.
Name | Runtime | batch_size | trainer/global_step | loss/total_train_step | loss/ce | loss/kl | loss/commit_loss | codebook/used_codes | codebook/utilization |
---|---|---|---|---|---|---|---|---|---|
resume concat 100e 2048 - 512 dup_noise - bs42 - max_token200 - mix_data_73 concat true - remove_kl - 100e - ddp8 | 50834 | 42 | 24899 | 0.03272935375571251 | 0.0244855098426342 | 0 | 0.008243842981755733 | 1952.5 | 95.29039001464844 |
Preview table of comparison
Audio ID | Ground Truth | Trained Quantizer Output | Whisper medium Output |
---|---|---|---|
audio_1149 | cảm giác ấm áp anh dành cho tôi anh vào | cảm giác ấm áp anh dành cho tôi anh vào | cảm giác ấm áp anh dành cho tôi |
audio_1150 | rằng những gì đang xảy ra điều trị làm | bằng những gì đang xảy ra điều trị làm | những gì đang xảy ra đều chỉ làm cho tình thường |
audio_1151 | đảo côn lôn thành tỉnh côn sơn trong | làào quôn lôn thành tìnhnh côn sơn trong | là cuốn luôn thành tỉnh cuốn sơn |
audio_1152 | có gì nhiều chỉ có hai mảnh đất mảnh | có gì nhiều chỉ có hai mảnh đất mảnh | empty |
audio_1153 | chỉ đến khi người con thứ của ông vua | chỉ đến khi người con thứ của ông vua | Chỉ đến khi người con thứ của ông vừa vô cùng vô cùng vô cùng vô cùng vô cùng vô cùng vô cùng vô cùng vô cùng vô cùng vô cùng vô cùng vô cùng vô cùng vô cùng vô cùng vô cùng vô cùng vô cùng vô cùng vô cùng vô cùng vô cùng vô cùng vô cùng vô cùng vô cùng vô cùng vô cùng vô cùng vô cùng vô cùng vô cùng vô cùng vô cùng vô cùng vô cùng vô cùng vô cùng vô cùng vô cùng vô cùng vô cùng vô cùng vô cùng vô cùng vô cùng vô cùng vô cùng vô cùng vô cùng vô cùng vô cùng vô cùng vô cùng vô cùng vô cùng vô cùng vô cùng vô cùng vô cùng vô cùng vô cùng vô cùng vô cùng vô cùng vô cùng vô cùng vô cùng vô cùng vô |
audio_1154 | ngập tràn nhưng cô vẫn giả vờ làm bộ mặt | ngập tràn nhưng cô vẫn giả vờ làm bộ mặt | nhưng cô vẫn giả vờ làm bộ mặt |
After all experiments, we concluded that with duplicate 512 codebook init checkpoint with noises to remain values of weight, concat multiple short audios up to 30s, trained on mixed datasets, turn off KL loss, we returned the best results.
Phase 1: kl (distillation loss) + ce loss
Phase 2: ce loss
Problem
- Very low WER result on both Vietnamese and English datasets, recheck evaluation pipeline.
- Test only on Vietnamese text normalized (lowercase, no punctuation, etc. ) in Bud500 is not efficient for real-world applications.
Solution
- Check if the inference pipeline is autoregressive or not.
- Test on ViVoice dataset has natural transcript (text capitalized, has punctuation, etc.).
Implementation
- Fix inference bug, remove
input_toks
in forward pass and apply autoregressivewhisper.DecodingOptions()
WER comparison
Dataset | Language | Trained Quantizer | Pho Whisper | Whisper Medium |
---|---|---|---|---|
Bud500 | Vi | 0.22 | 0.08 | 0.62 |
LibriTTS-R | En | 1.90 | 0.46 | 0.12 |
- Test checkpoint on phase 1 (100e, use KL loss) and phase 2 (21e, not use KL loss) training
Preview table of comparison (ckpt phase 2)
Audio ID | Ground Truth | Trained Quantizer Output | PhoWhisper Large Output | Whisper medium Output |
---|---|---|---|---|
audio_10 | Đại tá Trần Đình Hưng, Phó Chỉ huy trưởng, Tham mưu trưởng, Bộ Chỉ huy quân sự tỉnh | đại tá trần định hưng phó chỉ huy trưởng tham mưu trưởng bầu chỉ quy quân sự tỉnh | đại tá trần đình hưng phó chỉ huy trưởng tham mưu trưởng bộ chỉ huy quân sự tỉnh | Đại tá Trần Đình Hưng, phó chỉ quy trưởng, tham mưu trưởng, Bộ Chủ quy quân sự tỉnh |
Preview table of comparison (ckpt phase 1, better)
Audio ID | Ground Truth | Trained Quantizer Output | PhoWhisper Large Output | Whisper medium Output |
---|---|---|---|---|
audio_10 | Đại tá Trần Đình Hưng, Phó Chỉ huy trưởng, Tham mưu trưởng, Bộ Chỉ huy quân sự tỉnh | Đại tá Trần Đình Hưng, Phó Chỉ quy trưởng, Tham mưu trưởng, Bộ Chính quyên quân sự tỉnh | đại tá trần đình hưng phó chỉ huy trưởng tham mưu trưởng bộ chỉ huy quân sự tỉnh | Đại tá Trần Đình Hưng, phó chỉ quy trưởng, tham mưu trưởng, Bộ Chủ quy quân sự tỉnh |
Errors in data samplings
- We should treat clean and finished data like libri separately, there is no need for excess concatenation like below.
- Concatenation should only happen in low resource languages.
- Reduce the number of training epochs.
- Process the data before running the experiment, AVOID MAKING DYNAMIC DATASET ON THE GO BY MOOD AT ALL COST.
Next test run:
- 2 phases:
- Phase 1: KL Loss + CE Loss
- Phase 2: CE Loss
- 10 epochs for each phase
What to validate
- Is the English degradation is due to wrong data samplings (above error?)
- Are we overtrain with current 100 epochs?
Only after validated the above points we can move forwards with next steps
cc @tuanlda78202 TBD today
Problem
- Are we overtraining with the current 100 epochs for both training phases?
- Is the English degradation due to concatenating audio and the sampling distribution being skewed to the Vietnamese language?
Solution
- Reduce the training epochs for both training phases.
- Train on the original English (non-concatenate) dataset and balance the sampling distribution on a per-batch basis to regularize the model.
Implementation
- Train two phases, each phase with 10 epochs. Phase 1 turns off KL loss and weights the dataset distribution to
0.5
for concatenating Bud500 and non-concatenating LibriTTS-R.
Name | Epoch | KL Loss | Val Loss | Val Acc |
---|---|---|---|---|
Phase 1 | 10 | On | 15.56 | 0.84 |
Phase 2 | 10 | Off | 0.21 | 0.94 |
- Inference checkpoints are saved after 10 epochs for both phases on ViVoice (100 samples) and LibriTTS-R.
ViVoice first 100 samples (Vietnamese) |
LibriTTS-R (English) |
Conclusion
- High-quality datasets have an important impact on quantizer performance.
- Overtraining phase 1 with KL loss makes the model more generalized.
- Continual training phase 2 without KL loss on a few epochs helps the model avoid overfitting.
Training model on high-quality datasets
Problem
- Training model on low-quality datasets (Bud500) lead to poor performance
Solution
- Training model on high-quality datasets (viVoice)
Results
Phase 1 (with KL loss)
Training on viVoice (868k samples in jan-hq) and LibriTTS-R (112k samples), not use concat 30s dataset, early stopping if accuracy during 10 epochs does not improve
# Implementation of accuracy metric for early stopping
def _update_validation_metrics(self, logits, output_toks):
valid_toks = output_toks != -100
self.val_true += (
(logits.detach().argmax(-1)[valid_toks] == output_toks[valid_toks])
.float()
.sum()
)
self.val_total += valid_toks.float().sum()
def get_metrics(self):
metrics = {
"acc_0": (self.val_true / self.val_total).item(),
}
self.val_true[:] = 0
self.val_total[:] = 0
return metrics
Result on val
dataset
Exp ID | Number of samples | Best epoch | Training time | Accuracy | Loss |
---|---|---|---|---|---|
p1-vivoice+librittsr | 10000 | 29 | 2d 12h 29m 37s | 0.89 | 14.59 |
Visualization Loss & Accuracy on val
phase
|
|
Testing
Summary results
Model Name | Phase | Epoch | Dataset train | Dataset test | Language Test | Test samples | WER |
---|---|---|---|---|---|---|---|
Ichigo Quantizer | 1 | 10 | Bud500 + LibriTTS-R | LibriTTS-R | En | 4689 | 0.56 |
Ichigo Quantizer | 1 | 29 | viVoice + LibriTTS-R | LibriTTS-R | En | 4689 | 0.13 |
Ichigo Quantizer | 2 | 100 | Bud500 + LibriTTS-R | LibriTTS-R | En | 4689 | 1.90 |
Ichigo Quantizer | 2 | 10 | Bud500 + LibriTTS-R | LibriTTS-R | En | 4689 | 0.22 |
PhoWhisper Large | - | - | - | LibriTTS-R | En | 4689 | 0.47 |
Whisper Medium | - | - | - | LibriTTS-R | En | 4689 | 0.12 |
Model Name | Phase | Epoch | Dataset train | Dataset test | Language Test | Test samples | WER |
---|---|---|---|---|---|---|---|
Ichigo Quantizer | 1 | 29 | viVoice + LibriTTS-R | viVoice | Vi | 10000 | 0.21 |
PhoWhisper Large | - | - | - | viVoice | Vi | 10000 | 0.23 |
Whisper Medium | - | - | - | viVoice | Vi | 10000 | 0.18 |
- LibriTTS-R
|
|
- viVoice
|
|
Phase 2 (without KL loss)
Phase 1 (29e), Phase 2 10e
- viVoice
Model Name | Phase | Epoch | Dataset train | Dataset test | Language Test | Test samples | WER | WER (%) |
---|---|---|---|---|---|---|---|---|
Ichigo Quantizer | 2 | 2 | viVoice + LibriTTS-R | viVoice | Vi | 10000 | 0.20 | 20 |
PhoWhisper Large | - | - | - | viVoice | Vi | 10000 | 0.24 | 24 |
Whisper Medium | - | - | - | viVoice | Vi | 10000 | 0.18 | 18 |
Ichigo Quantizer | 2 | 9 | viVoice + LibriTTS-R | viVoice | Vi | 1000 | 0.11 | 11 |
PhoWhisper Large | - | - | - | viVoice | Vi | 1000 | 0.23 | 23 |
Whisper Medium | - | - | - | viVoice | Vi | 1000 | 0.18 | 18 |
- LibriTTS-R
Model Name | Phase | Epoch | Dataset train | Dataset test | Language Test | Test samples | WER | WER (%) |
---|---|---|---|---|---|---|---|---|
Ichigo Quantizer | 2 | 2 | viVoice + LibriTTS-R | LibriTTS-R | En | 4689 | 1.23 | 123 |
PhoWhisper Large | - | - | - | LibriTTS-R | En | 4689 | 0.47 | 47 |
Whisper Medium | - | - | - | LibriTTS-R | En | 4689 | 0.12 | 12 |
Ichigo Quantizer | 2 | 9 | viVoice + LibriTTS-R | LibriTTS-R | En | 1000 | 0.70 | 70 |
PhoWhisper Large | - | - | - | LibriTTS-R | En | 1000 | 0.59 | 59 |
Whisper Medium | - | - | - | LibriTTS-R | En | 1000 | 0.12 | 12 |
- Add prompt into
whisper.DecodingOptions
when testing
prompt = f"You are a professional transcriber, fluent in {prefix_lang}. You are listening to a recording in which a person is potentially speaking {prefix_lang}, and no other languages. They may have a strong accent. You are to transcribe utterances of {prefix_lang} accordingly"
Model Name | Prompt | Phase | Epoch | Dataset train | Dataset test | Language Test | Test samples | WER | WER (%) |
---|---|---|---|---|---|---|---|---|---|
Ichigo Quantizer | No | 2 | 9 | viVoice + LibriTTS-R | LibriTTS-R | En | 1000 | 0.93 | 93 |
PhoWhisper Large | No | - | - | - | LibriTTS-R | En | 1000 | 0.59 | 59 |
Whisper Medium | No | - | - | LibriTTS-R | En | 1000 | 0.12 | 12 | |
Ichigo Quantizer | Yes | 2 | 9 | viVoice + LibriTTS-R | viVoice | Vi | 1000 | 0.13 | 13 |
PhoWhisper Large | No | - | - | - | viVoice | Vi | 1000 | 0.23 | 23 |
Whisper Medium | Yes | - | - | - | viVoice-R | Vi | 1000 | 0.17 | 17 |
Phase 1 full epoch (100e), Phase 2 10e [Ongoing]
Phase 1
- viVoice
Model Name | Phase | Epoch | Dataset train | Dataset test | Language Test | Test samples | WER | WER (%) |
---|---|---|---|---|---|---|---|---|
Ichigo Quantizer | 1 | 29 | viVoice + LibriTTS-R | viVoice | Vi | 10000 | 0.21 | 21 |
PhoWhisper Large | - | - | - | viVoice | Vi | 10000 | 0.23 | 23 |
Whisper Medium | - | - | - | viVoice | Vi | 10000 | 0.18 | 18 |
Ichigo Quantizer | 1 | 62 | viVoice + LibriTTS-R | viVoice | Vi | 1000 | 0.18 | 18 |
PhoWhisper Large | - | - | - | viVoice | Vi | 1000 | 0.23 | 23 |
Whisper Medium | - | - | - | viVoice | Vi | 1000 | 0.18 | 18 |
- LibriTTS-R
Model Name | Phase | Epoch | Dataset train | Dataset test | Language Test | Test samples | WER | WER (%) |
---|---|---|---|---|---|---|---|---|
Ichigo Quantizer | 1 | 29 | viVoice + LibriTTS-R | LibriTTS-R | En | 4689 | 0.13 | 13 |
PhoWhisper Large | - | - | - | LibriTTS-R | En | 4689 | 0.47 | 47 |
Whisper Medium | - | - | - | LibriTTS-R | En | 4689 | 0.12 | 12 |
Ichigo Quantizer | 1 | 62 | viVoice + LibriTTS-R | LibriTTS-R | En | 1000 | 0.13 | 13 |
PhoWhisper Large | - | - | - | LibriTTS-R | En | 1000 | 0.59 | 59 |
Whisper Medium | - | - | - | LibriTTS-R | En | 1000 | 0.12 | 12 |
Bug: Incorrect Mask Generation After Audio Padding
Description
During code review, we discovered that the mask generation after audio padding is incorrectly implemented. The current code creates masks with all 1s for padded audio. Given the padded audio with the padding value = 0:
Input Audio: [1, 2, 3, 4, 5, 0, 0, 0, 0, 0]
# Current problematic code
concatenated_audio = self.pad_audio(concatenated_audio)
mask = torch.zeros(30 * 16000 // 320, dtype=torch.bool)
audio_frames = min(len(concatenated_audio), self.max_audio_length) // 320
mask[:audio_frames] = 1 # Bug: This includes padding tokens
- The training code reserves a special VQ token for padding (vq_codes + 1).
- However, with mask=1 everywhere, ~mask becomes all zeros:
So instead of creating a mask:Expected Mask: [1, 1, 1, 1, 1, 0, 0, 0, 0, 0]
@tuanlda78202 created a buggy mask:[1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
- Result: The special padding embedding is never applied cause ~mask will be all 0s.
# Impact on training
if self.training and self.config.mask_embs and mask is not None:
x[~mask] = project_out(self.rq.layers[0]._codebook.embed[0, self.vq_codes])
Impact
- The padding tokens is not trained.
- Model sees raw embeddings in padding regions instead of consistent padding tokens. Leading to longer semantic token sequence with fixed length = 750 when encode audio.
Next Steps
- Fix mask generation when padding audio.
- Add validation test for mask generation.
This is solved by this PR : janhq/WhisperSpeech#19.
[Add PAD tokens] IchigoWhisper Phase 1 50e, Phase 2 20e
Phase 1
- viVoice
Model Name | Phase | Epoch | Dataset train | Dataset test | Language Test | Test samples | WER (%) |
---|---|---|---|---|---|---|---|
IchigoWhisper | 1 | 2 | viVoice + LibriTTS-R | viVoice | Vi | 1000 | 25.92 |
IchigoWhisper | 1 | 2 | viVoice + LibriTTS-R | viVoice | Vi | 10000 | 23.66 |
IchigoWhisper | 1 | 39 | viVoice + LibriTTS-R | viVoice | Vi | 10000 | 22.37 |
IchigoWhisper | 1 | 49 | viVoice + LibriTTS-R | viVoice | Vi | 10000 | 20.75 |
PhoWhisper Large 1.55B | - | - | - | viVoice | Vi | 1000 | 23.08 |
Whisper Medium 0.76B | - | - | - | viVoice | Vi | 1000 | 20.45 |
PhoWhisper Large 1.55B | - | - | - | viVoice | Vi | 10000 | 24.00 |
Whisper Medium 0.76B | - | - | - | viVoice | Vi | 10000 | 17.78 |
- LibriTTS-R
Model Name | Phase | Epoch | Dataset train | Dataset test | Language Test | Test samples | WER (%) |
---|---|---|---|---|---|---|---|
IchigoWhisper | 1 | 2 | viVoice + LibriTTS-R | LibriTTS-R | En | 4689 | 16.82 |
IchigoWhisper | 1 | 39 | viVoice + LibriTTS-R | LibriTTS-R | En | 4689 | 18.81 |
IchigoWhisper | 1 | 49 | viVoice + LibriTTS-R | LibriTTS-R | En | 4689 | 17.46 |
PhoWhisper Large 1.55B | - | - | - | LibriTTS-R | En | 4689 | 47.52 |
Whisper Medium 0.76B | - | - | - | LibriTTS-R | En | 4689 | 13.06 |
Phase 2 (Ongoing)
- Resume from phase 1 49e checkpoint
- viVoice
Model Name | Phase | Epoch | Dataset train | Dataset test | Language Test | Test samples | WER (%) |
---|---|---|---|---|---|---|---|
IchigoWhisper | 2 | 5 | viVoice + LibriTTS-R | viVoice | Vi | 1000 | 14.46 |
IchigoWhisper | 2 | 20 | viVoice + LibriTTS-R | viVoice | Vi | 1000 | 12.91 |
PhoWhisper Large 1.55B | - | - | - | viVoice | Vi | 1000 | 23.08 |
Whisper Medium 0.76B | - | - | - | viVoice | Vi | 1000 | 18.64 |
- LibriTTS-R
Model Name | Phase | Epoch | Dataset train | Dataset test | Language Test | Test samples | WER (%) |
---|---|---|---|---|---|---|---|
IchigoWhisper | 2 | 5 | viVoice + LibriTTS-R | LibriTTS-R | En | 1000 | 16.78 |
IchigoWhisper | 2 | 20 | viVoice + LibriTTS-R | LibriTTS-R | En | 1000 | 17.22 |
PhoWhisper Large 1.55B | - | - | - | LibriTTS-R | En | 1000 | 59.72 |
Whisper Medium 0.76B | - | - | - | LibriTTS-R | En | 1000 | 12.99 |
[Merge Codebooks] IchigoWhisper Phase 1 50e, Phase 2 5e
How to merge?
Codebook Size = 2049 (IchigoWhisper w/ mask token) + 512 (WhisperVQ w/o mask token)
2048 w/ mask first, 512 later
# 1. Initial State
Codebook 512: [512 codes + 1 mask token]
[C1 C2 C3 ... C512 M]
Codebook 2048: [2048 codes + 1 mask token]
[D1 D2 D3 ... D2048 M]
# 2. Remove Mask Token from 512
Codebook 512 (without mask):
[C1 C2 C3 ... C512] # 512 codes
Codebook 2048 (keeps mask):
[D1 D2 D3 ... D2048 M] # 2049 codes
# 3. Create New Empty Codebook
New Size = 512 + 2049 = 2561 codes
[_ _ _ ... _ _ _] # 2561 empty slots
# 4. Merge Process
Step 2: Copy 2048+mask first
[D1 D2 D3 ... D2048 M | _ _ _ ... _ _ _ _ ]
|----2049 codes----| |-----512 slots-----|
Step 2: Copy 512 codes after
[D1 D2 D3 ... D2048 M | C1 C2 C3 ... C512 |]
|----2049 codes----| |-----512 codes-----|
Experiments
- viVoice
Model Name | Phase | Codebook Size | Base VQ | Dataset train | Dataset test | Language Test | Test samples | WER |
---|---|---|---|---|---|---|---|---|
IchigoWhisper | 2 | 2561 | en+pl |
viVoice + LibriTTS-R | viVoice | Vi | 1000 | 11.39 |
IchigoWhisper | 2 | 2049 | v3-7lang |
viVoice + LibriTTS-R | viVoice | Vi | 1000 | 14.46 |
IchigoWhisper | 2 | 2561 | v3-7lang |
viVoice + LibriTTS-R | viVoice | Vi | 1000 | 11.36 |
PhoWhisper Large 1.55B | - | - | - | - | viVoice | Vi | 1000 | 23.08 |
Whisper Medium 0.76B | - | - | - | - | viVoice | Vi | 1000 | 18.64 |
- LibriTTS-R
Model Name | Phase | Codebook Size | Base VQ | Dataset train | Dataset test | Language Test | Test samples | WER |
---|---|---|---|---|---|---|---|---|
IchigoWhisper | 2 | 2561 | en+pl |
viVoice + LibriTTS-R | LibriTTS-R | En | 1000 | 12.96 |
IchigoWhisper | 2 | 2049 | v3-7lang |
viVoice + LibriTTS-R | LibriTTS-R | En | 1000 | 16.78 |
IchigoWhisper | 2 | 2561 | v3-7lang |
viVoice + LibriTTS-R | LibriTTS-R | En | 1000 | 13.01 |
PhoWhisper Large 1.55B | - | - | - | - | LibriTTS-R | En | 1000 | 59.72 |
Whisper Medium 0.76B | - | - | - | - | LibriTTS-R | En | 1000 | 12.99 |