janhq/ichigo

training run: tuning ichigo-quantizer

Opened this issue · 15 comments

Goal

Tuning to find the best params for Ichigo Quantizer

Hypothesis

  • Current default codebook (512) is too small for multilingual, and also training from scratch needs so much data to converge
  • Training on mixed dataset (Eng-Vie) maintains the performance of original WhisperVQ

Task:

  • Adding code to init codebook embedding from WhisperVQ checkpoint
  • Fix training loss stuck
  • Adding code to preprocess multi training datasets (Filtered LibriTTS-R + Viet-Bud500)
  • Adding inference quantizer and fill Google Sheet
  • Packing short audio (~3s) data (up to 30s)
  • Verify KL loss
  • Add WER for Vietnamese
  • Training on high-quality Vietnamese dataset (ViVoice)
  • Refine Bud500 dataset
  • Scale training on large dataset (ViVoice, MLS, etc.)

Training Result:

The training result here: #146

  • Observation:
    Dataset mostly contains sequences <200 tokens (30s audio), leading to high KL loss.
    Data distribution:
    image

  • Testing:

    • Adjusted max tokens to 20, 50, and 200.
    • Results:
      • At max tokens = 20, KL loss dropped to 4.
      • At max tokens = 200, KL loss > 10.
      • Loss test:
Screenshot 2024-12-03 at 20 20 28
  • Problem:
    Excessive padding tokens during training inflated the loss.

  • Solution:
    Implement dataset packing to reduce padding tokens.

Current PR (WIP): janhq/WhisperSpeech#8

Problem

  • Default codebook size of WhisperVQ is so small (512), it may not be effective for training on multilingual datasets; also, training from scratch requires so much data and time to converge to the best loss.
  • High loss when training and stuck stable, it cannot learn more.
  • Training only on new language datasets may decrease the performance of the English checkpoint pretrained VQ
  • Training on short audio (~3s) with padding is very inefficient.
  • KL loss can be divergence predictions output.

Solution

  • Init weights from a checkpoint of WhisperVQ-7lang (trained on multiple languages, good result).
  • Try to modify the architecture (codebook, dim), or data pipeline, do experiments, and verify hypotheses.
  • Trained on mixed language datasets
  • Concat multiple short audio files to a long audio file (30s)
  • Turn off KL loss when training

Implementation

  1. Init weights from a checkpoint of WhisperVQ-7lang: WhisperVQ checkpoints have a codebook size (512) mismatch with our current increased codebook size (1024), so we filled the first 512 weights into the models and ran an experiment on avg embedding, Kaiming init and try duplicate next 512 with random noise with remaining tokens.
image
  1. Try with different factors of KL loss to check if it impacts the main model.
Name Runtime batch_size trainer/global_step loss/total_train_step loss/ce loss/kl loss/commit_loss codebook/used_codes codebook/utilization
init ckpt dim 1024 - kl 5 31m 1s 42 782 78.53279 1.11499 15.48318 0.0018883 858 83.78906
init ckpt dim 1024 - kl 2 1h 57m 1s 42 2924 30.53683 0.91597 14.80941 0.0020393 875 85.44922
init ckpt dim 1024 - kl 1.5 1h 58m 7s 42 2939 23.34618 0.84657 14.99841 0.0019927 886 86.52344
init ckpt dim 1024 - kl 3 2h 43m 4s 42 3726 45.72793 1.0146 14.90375 0.0020783 868 84.76563
  1. Add mask logit before the softmax function KL loss to check if it can be a factor in the loss model.
Name trainer/global_step codebook/used_codes_step codebook/utilization_step loss/ce_loss_st loss/commit_loss_step loss/kl_loss_step loss/total_train_step
diana_lavenderblush 3335 889 86.81641 0.29703 0.0013738 0.3107 0.60911
  1. Check the raw dataset, including the length of the audio and distribution of text tokens; we found that in the original implementation of WhisperSpeech, the max token value was very high (200), and our data was so small (max 20) that it led to more padding and made the KL loss value still very high. We verified this hypothesis with different max_token values and saw that with max token equal to the max number of tokens in the dataset, it returned the lowest loss.
Name Runtime batch_size trainer/global_step loss/total_train_step loss/ce loss/kl loss/commit_loss codebook/used_codes codebook/utilization
init_ckpt dim 1024 - 512 later random - bs42 - max_token=20 7h 32m 19s 42 15099 2.75478 0.82511 1.92655 0.00311208 968 94.53125
init_ckpt dim 1024 - 512 later random - bs42 - max_token=50 7h 43m 51s 42 15099 4.71717 0.7786 3.93544 0.00313314 991 96.77734
init_ckpt dim 1024 - 512 later random - bs8 1h 30m 42s 8 9999 15.83308 0.76369 15.06781 0.00151839 888 86.71875
init_ckpt dim 1024 - 512 later random - bs42 - max_token=200 1h 5m 41s 42 1689 15.96285 0.83534 15.12557 0.001936 983 95.99609
init_ckpt dim 1024 - 512 later avg 3h 21m 52s 42 4999 15.39613 0.80295 14.59104 0.00213388 865 84.47266
  1. Test with trained quantizer, return better result compare with whisper medium (has many hallucination), result in this sheet.

Preview table of comparison

Audio ID Ground Truth Trained Quantizer Output Whisper Output
audio_0_6 các bác sĩ có thể chăm sóc người bệnh Các bác sĩ có thể chăm sóc người bệnh để các bác sĩ có thể chăm sóc người bệnh.
audio_0_7 em bây giờ mới là hiện tại của anh ấy Em bây giờ mới là hiện tại của anh ấy Em bây giờ mới là hiện tại của anh ấy
audio_0_8 thôi anh đừng nói gì nữa tôi chưa đủ khổ Thôi anh đừng nói gì nữa, chưa đủ khổ Thôi anh đừng nói gì nữa. Tôi chưa đủ khổ.
  1. Experiment with removing special_tokens when encoding text input; the result is very bad.
Name Runtime batch_ trainer/global_step loss/total_train_step loss/ce loss/kl loss/commit_loss codebook/used_codes codebook/utilization
no special tokens 4h 27m 40 80 6522 5.34698 3.75046 1.59528 0.0012379 867 84.58537
  1. For speed up training, we removed WebDataset implemented in WhisperSpeech and using native DataLoader PyTorch, it reduced time data processed in CPU, boosted up GPUs utilization 100% and reduced time training from 13h -> 5h on a single A6000 GPU, supported multi-gpus training DDP

  2. Trained on mixed (English + Vietnamese) data, including LibriTTS-R (27Gb, 112k train samples) for English, and Viet-Bud500 (98Gb, 630k samples), with weighted sampling between Vietnamese (70%) and English (30%) dataset (apply on batch training distribution.

Name Runtime batch_size trainer/global_step loss/total_train_step loss/ce loss/kl loss/commit_loss codebook/used_codes codebook/utilization
init ckpt 1024 - 512 random - bs80 - max_token20 - mix_data 7952 80 3191 N/A 0.5281 1.4206 0.0020 964 94.05%
init ckpt 2048 - 512 dup noise - bs80 - max_token20 - mix_data 24916 80 9338 1.5538 0.4797 1.0716 0.0025 1928 94.09%
init ckpt 1024 - 512 dup noise - bs80 - max_token20 - mix_data 57964 80 23329 1.6326 0.4719 1.1567 0.0040 1006 98.17%
init ckpt 2048 - 512 random - bs80 - max_token20 - mix_data 58649 80 23329 1.3866 0.4406 0.9432 0.0028 1557 75.99%
  1. Concat multiple short audios to a long audio (30s), set max_token=200, reduced training time 10x times (8 minutes/epoch) on mixed datasets, because when concatenate 30s, the current samples will group ~10, leading to a reduced number of training samples 10x.
Name Runtime batch_size trainer/global_step loss/total_train_step loss/ce loss/kl loss/commit_loss codebook/used_codes codebook/utilization
scratch 2048 - largev3 - bs24 - max_token200 - mix_data_73 - 100e - ddp8 68056 24 32263 6.444209575653076 0.4545549750328064 5.978719234466553 0.010935629718005655 1266.125 61.792335510253906
init ckpt 2048 - 512 dup_noise - bs42 - max_token200 - mix_data_73 - wo_w_loss - 100e - ddp8 50272 42 24899 12.359258651733398 0.4022634625434875 11.945953369140623 0.011041682213544846 1974.25 96.35187530517578
  1. Change Whisper from medium to large-v3. Found out many hallucination responses and mixed dataset lead to English bias when training, resulting in predictions_loss=5.47 - largev3.

Preview table of comparison

Audio ID Ground Truth Trained Quantizer Output Whisper Output
audio_0_13 nơi đây và em thích con người ở đây em one đây và em thích con người ở đây em Hãy subscribe cho kênh Ghiền Mì Gõ Để không bỏ lỡ những video hấp dẫn
audio_0_15 cước được tắm suối mịn màng trắng sáng ure được tắm suối miịn màng sắng sáng Hãy subscribe cho kênh Ghiền Mì Gõ Để không bỏ lỡ những video hấp dẫn
audio_1_32 dũng mới cứu được tôi thôi chứ không còn .ng can cứu được tôi thôi chứ không còn stubborn stubborn stubborn stubborn Hãy subscribe cho kênh Ghiền Mì Gõ Để không bỏ lỡ những video hấp dẫn
  1. Turn off KL loss, resume weights from phase 1 (trained 100e), training result: very good output predictions (better than whisper medium), updated inference at epoch 21 in here.
Name Runtime batch_size trainer/global_step loss/total_train_step loss/ce loss/kl loss/commit_loss codebook/used_codes codebook/utilization
resume concat 100e 2048 - 512 dup_noise - bs42 - max_token200 - mix_data_73 concat true - remove_kl - 100e - ddp8 50834 42 24899 0.03272935375571251 0.0244855098426342 0 0.008243842981755733 1952.5 95.29039001464844

Preview table of comparison

Audio ID Ground Truth Trained Quantizer Output Whisper medium Output
audio_1149 cảm giác ấm áp anh dành cho tôi anh vào cảm giác ấm áp anh dành cho tôi anh vào cảm giác ấm áp anh dành cho tôi
audio_1150 rằng những gì đang xảy ra điều trị làm bằng những gì đang xảy ra điều trị làm những gì đang xảy ra đều chỉ làm cho tình thường
audio_1151 đảo côn lôn thành tỉnh côn sơn trong làào quôn lôn thành tìnhnh côn sơn trong là cuốn luôn thành tỉnh cuốn sơn
audio_1152 có gì nhiều chỉ có hai mảnh đất mảnh có gì nhiều chỉ có hai mảnh đất mảnh empty
audio_1153 chỉ đến khi người con thứ của ông vua chỉ đến khi người con thứ của ông vua Chỉ đến khi người con thứ của ông vừa vô cùng vô cùng vô cùng vô cùng vô cùng vô cùng vô cùng vô cùng vô cùng vô cùng vô cùng vô cùng vô cùng vô cùng vô cùng vô cùng vô cùng vô cùng vô cùng vô cùng vô cùng vô cùng vô cùng vô cùng vô cùng vô cùng vô cùng vô cùng vô cùng vô cùng vô cùng vô cùng vô cùng vô cùng vô cùng vô cùng vô cùng vô cùng vô cùng vô cùng vô cùng vô cùng vô cùng vô cùng vô cùng vô cùng vô cùng vô cùng vô cùng vô cùng vô cùng vô cùng vô cùng vô cùng vô cùng vô cùng vô cùng vô cùng vô cùng vô cùng vô cùng vô cùng vô cùng vô cùng vô cùng vô cùng vô cùng vô cùng vô cùng vô cùng vô
audio_1154 ngập tràn nhưng cô vẫn giả vờ làm bộ mặt ngập tràn nhưng cô vẫn giả vờ làm bộ mặt nhưng cô vẫn giả vờ làm bộ mặt

After all experiments, we concluded that with duplicate 512 codebook init checkpoint with noises to remain values of weight, concat multiple short audios up to 30s, trained on mixed datasets, turn off KL loss, we returned the best results.

Phase 1: kl (distillation loss) + ce loss
Phase 2: ce loss

Problem

  • Very low WER result on both Vietnamese and English datasets, recheck evaluation pipeline.
  • Test only on Vietnamese text normalized (lowercase, no punctuation, etc. ) in Bud500 is not efficient for real-world applications.

Solution

  • Check if the inference pipeline is autoregressive or not.
  • Test on ViVoice dataset has natural transcript (text capitalized, has punctuation, etc.).

Implementation

  • Fix inference bug, remove input_toks in forward pass and apply autoregressive whisper.DecodingOptions()

WER comparison

Dataset Language Trained Quantizer Pho Whisper Whisper Medium
Bud500 Vi 0.22 0.08 0.62
LibriTTS-R En 1.90 0.46 0.12
  • Test checkpoint on phase 1 (100e, use KL loss) and phase 2 (21e, not use KL loss) training
image

Preview table of comparison (ckpt phase 2)

Audio ID Ground Truth Trained Quantizer Output PhoWhisper Large Output Whisper medium Output
audio_10 Đại tá Trần Đình Hưng, Phó Chỉ huy trưởng, Tham mưu trưởng, Bộ Chỉ huy quân sự tỉnh đại tá trần định hưng phó chỉ huy trưởng tham mưu trưởng bầu chỉ quy quân sự tỉnh đại tá trần đình hưng phó chỉ huy trưởng tham mưu trưởng bộ chỉ huy quân sự tỉnh Đại tá Trần Đình Hưng, phó chỉ quy trưởng, tham mưu trưởng, Bộ Chủ quy quân sự tỉnh

Preview table of comparison (ckpt phase 1, better)

Audio ID Ground Truth Trained Quantizer Output PhoWhisper Large Output Whisper medium Output
audio_10 Đại tá Trần Đình Hưng, Phó Chỉ huy trưởng, Tham mưu trưởng, Bộ Chỉ huy quân sự tỉnh Đại tá Trần Đình Hưng, Phó Chỉ quy trưởng, Tham mưu trưởng, Bộ Chính quyên quân sự tỉnh đại tá trần đình hưng phó chỉ huy trưởng tham mưu trưởng bộ chỉ huy quân sự tỉnh Đại tá Trần Đình Hưng, phó chỉ quy trưởng, tham mưu trưởng, Bộ Chủ quy quân sự tỉnh

Errors in data samplings

  • We should treat clean and finished data like libri separately, there is no need for excess concatenation like below.
  • Concatenation should only happen in low resource languages.
  • Reduce the number of training epochs.
  • Process the data before running the experiment, AVOID MAKING DYNAMIC DATASET ON THE GO BY MOOD AT ALL COST.

image

Next test run:

  • 2 phases:
  • Phase 1: KL Loss + CE Loss
  • Phase 2: CE Loss
  • 10 epochs for each phase

What to validate

  • Is the English degradation is due to wrong data samplings (above error?)
  • Are we overtrain with current 100 epochs?

Only after validated the above points we can move forwards with next steps

cc @tuanlda78202 TBD today

Problem

  1. Are we overtraining with the current 100 epochs for both training phases?
  2. Is the English degradation due to concatenating audio and the sampling distribution being skewed to the Vietnamese language?

Solution

  1. Reduce the training epochs for both training phases.
  2. Train on the original English (non-concatenate) dataset and balance the sampling distribution on a per-batch basis to regularize the model.

Implementation

  1. Train two phases, each phase with 10 epochs. Phase 1 turns off KL loss and weights the dataset distribution to 0.5 for concatenating Bud500 and non-concatenating LibriTTS-R.
Name Epoch KL Loss Val Loss Val Acc
Phase 1 10 On 15.56 0.84
Phase 2 10 Off 0.21 0.94
  1. Inference checkpoints are saved after 10 epochs for both phases on ViVoice (100 samples) and LibriTTS-R.
Image
ViVoice first 100 samples (Vietnamese)
Image
LibriTTS-R (English)

Conclusion

  • High-quality datasets have an important impact on quantizer performance.
  • Overtraining phase 1 with KL loss makes the model more generalized.
  • Continual training phase 2 without KL loss on a few epochs helps the model avoid overfitting.

Training model on high-quality datasets

  1. Problem
  2. Solution
  3. Results

Problem

  • Training model on low-quality datasets (Bud500) lead to poor performance

Solution

  • Training model on high-quality datasets (viVoice)

Results

Phase 1 (with KL loss)

Training on viVoice (868k samples in jan-hq) and LibriTTS-R (112k samples), not use concat 30s dataset, early stopping if accuracy during 10 epochs does not improve

# Implementation of accuracy metric for early stopping
def _update_validation_metrics(self, logits, output_toks):
    valid_toks = output_toks != -100
    self.val_true += (
        (logits.detach().argmax(-1)[valid_toks] == output_toks[valid_toks])
        .float()
        .sum()
    )
    self.val_total += valid_toks.float().sum()

def get_metrics(self):
    metrics = {
        "acc_0": (self.val_true / self.val_total).item(),
    }
    self.val_true[:] = 0
    self.val_total[:] = 0
    return metrics

Result on val dataset

Exp ID Number of samples Best epoch Training time Accuracy Loss
p1-vivoice+librittsr 10000 29 2d 12h 29m 37s 0.89 14.59

Visualization Loss & Accuracy on val phase

Image
Val Accuracy per epoch
Image
Val Loss per epoch

Testing

Summary results

Model Name Phase Epoch Dataset train Dataset test Language Test Test samples WER
Ichigo Quantizer 1 10 Bud500 + LibriTTS-R LibriTTS-R En 4689 0.56
Ichigo Quantizer 1 29 viVoice + LibriTTS-R LibriTTS-R En 4689 0.13
Ichigo Quantizer 2 100 Bud500 + LibriTTS-R LibriTTS-R En 4689 1.90
Ichigo Quantizer 2 10 Bud500 + LibriTTS-R LibriTTS-R En 4689 0.22
PhoWhisper Large - - - LibriTTS-R En 4689 0.47
Whisper Medium - - - LibriTTS-R En 4689 0.12
Model Name Phase Epoch Dataset train Dataset test Language Test Test samples WER
Ichigo Quantizer 1 29 viVoice + LibriTTS-R viVoice Vi 10000 0.21
PhoWhisper Large - - - viVoice Vi 10000 0.23
Whisper Medium - - - viVoice Vi 10000 0.18
  1. LibriTTS-R
Image
LibriTTS-R (English)
Image
Preview some predictions on LibriTTS-R (English)
  1. viVoice
Image
viVoice (Vietnamese)
Image
Preview some predictions on viVoice (Vietnamese)

Phase 2 (without KL loss)

Phase 1 (29e), Phase 2 10e

  1. viVoice
Model Name Phase Epoch Dataset train Dataset test Language Test Test samples WER WER (%)
Ichigo Quantizer 2 2 viVoice + LibriTTS-R viVoice Vi 10000 0.20 20
PhoWhisper Large - - - viVoice Vi 10000 0.24 24
Whisper Medium - - - viVoice Vi 10000 0.18 18
Ichigo Quantizer 2 9 viVoice + LibriTTS-R viVoice Vi 1000 0.11 11
PhoWhisper Large - - - viVoice Vi 1000 0.23 23
Whisper Medium - - - viVoice Vi 1000 0.18 18
  1. LibriTTS-R
Model Name Phase Epoch Dataset train Dataset test Language Test Test samples WER WER (%)
Ichigo Quantizer 2 2 viVoice + LibriTTS-R LibriTTS-R En 4689 1.23 123
PhoWhisper Large - - - LibriTTS-R En 4689 0.47 47
Whisper Medium - - - LibriTTS-R En 4689 0.12 12
Ichigo Quantizer 2 9 viVoice + LibriTTS-R LibriTTS-R En 1000 0.70 70
PhoWhisper Large - - - LibriTTS-R En 1000 0.59 59
Whisper Medium - - - LibriTTS-R En 1000 0.12 12
  1. Add prompt into whisper.DecodingOptions when testing
prompt = f"You are a professional transcriber, fluent in {prefix_lang}. You are listening to a recording in which a person is potentially speaking {prefix_lang}, and no other languages. They may have a strong accent. You are to transcribe utterances of {prefix_lang} accordingly"
Model Name Prompt Phase Epoch Dataset train Dataset test Language Test Test samples WER WER (%)
Ichigo Quantizer No 2 9 viVoice + LibriTTS-R LibriTTS-R En 1000 0.93 93
PhoWhisper Large No - - - LibriTTS-R En 1000 0.59 59
Whisper Medium No - - LibriTTS-R En 1000 0.12 12
Ichigo Quantizer Yes 2 9 viVoice + LibriTTS-R viVoice Vi 1000 0.13 13
PhoWhisper Large No - - - viVoice Vi 1000 0.23 23
Whisper Medium Yes - - - viVoice-R Vi 1000 0.17 17

Phase 1 full epoch (100e), Phase 2 10e [Ongoing]

Phase 1

  1. viVoice
Model Name Phase Epoch Dataset train Dataset test Language Test Test samples WER WER (%)
Ichigo Quantizer 1 29 viVoice + LibriTTS-R viVoice Vi 10000 0.21 21
PhoWhisper Large - - - viVoice Vi 10000 0.23 23
Whisper Medium - - - viVoice Vi 10000 0.18 18
Ichigo Quantizer 1 62 viVoice + LibriTTS-R viVoice Vi 1000 0.18 18
PhoWhisper Large - - - viVoice Vi 1000 0.23 23
Whisper Medium - - - viVoice Vi 1000 0.18 18
  1. LibriTTS-R
Model Name Phase Epoch Dataset train Dataset test Language Test Test samples WER WER (%)
Ichigo Quantizer 1 29 viVoice + LibriTTS-R LibriTTS-R En 4689 0.13 13
PhoWhisper Large - - - LibriTTS-R En 4689 0.47 47
Whisper Medium - - - LibriTTS-R En 4689 0.12 12
Ichigo Quantizer 1 62 viVoice + LibriTTS-R LibriTTS-R En 1000 0.13 13
PhoWhisper Large - - - LibriTTS-R En 1000 0.59 59
Whisper Medium - - - LibriTTS-R En 1000 0.12 12

Bug: Incorrect Mask Generation After Audio Padding

Description

During code review, we discovered that the mask generation after audio padding is incorrectly implemented. The current code creates masks with all 1s for padded audio. Given the padded audio with the padding value = 0:

Input Audio: [1, 2, 3, 4, 5, 0, 0, 0, 0, 0]
# Current problematic code
concatenated_audio = self.pad_audio(concatenated_audio)
mask = torch.zeros(30 * 16000 // 320, dtype=torch.bool)
audio_frames = min(len(concatenated_audio), self.max_audio_length) // 320
mask[:audio_frames] = 1  # Bug: This includes padding tokens
  • The training code reserves a special VQ token for padding (vq_codes + 1).
  • However, with mask=1 everywhere, ~mask becomes all zeros:
    So instead of creating a mask: Expected Mask: [1, 1, 1, 1, 1, 0, 0, 0, 0, 0] @tuanlda78202 created a buggy mask: [1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
  • Result: The special padding embedding is never applied cause ~mask will be all 0s.
# Impact on training
if self.training and self.config.mask_embs and mask is not None:
    x[~mask] = project_out(self.rq.layers[0]._codebook.embed[0, self.vq_codes])

Impact

  • The padding tokens is not trained.
  • Model sees raw embeddings in padding regions instead of consistent padding tokens. Leading to longer semantic token sequence with fixed length = 750 when encode audio.

Next Steps

  • Fix mask generation when padding audio.
  • Add validation test for mask generation.

This is solved by this PR : janhq/WhisperSpeech#19.

[Add PAD tokens] IchigoWhisper Phase 1 50e, Phase 2 20e

Phase 1

  1. viVoice
Model Name Phase Epoch Dataset train Dataset test Language Test Test samples WER (%)
IchigoWhisper 1 2 viVoice + LibriTTS-R viVoice Vi 1000 25.92
IchigoWhisper 1 2 viVoice + LibriTTS-R viVoice Vi 10000 23.66
IchigoWhisper 1 39 viVoice + LibriTTS-R viVoice Vi 10000 22.37
IchigoWhisper 1 49 viVoice + LibriTTS-R viVoice Vi 10000 20.75
PhoWhisper Large 1.55B - - - viVoice Vi 1000 23.08
Whisper Medium 0.76B - - - viVoice Vi 1000 20.45
PhoWhisper Large 1.55B - - - viVoice Vi 10000 24.00
Whisper Medium 0.76B - - - viVoice Vi 10000 17.78
  1. LibriTTS-R
Model Name Phase Epoch Dataset train Dataset test Language Test Test samples WER (%)
IchigoWhisper 1 2 viVoice + LibriTTS-R LibriTTS-R En 4689 16.82
IchigoWhisper 1 39 viVoice + LibriTTS-R LibriTTS-R En 4689 18.81
IchigoWhisper 1 49 viVoice + LibriTTS-R LibriTTS-R En 4689 17.46
PhoWhisper Large 1.55B - - - LibriTTS-R En 4689 47.52
Whisper Medium 0.76B - - - LibriTTS-R En 4689 13.06

Phase 2 (Ongoing)

  • Resume from phase 1 49e checkpoint
  1. viVoice
Model Name Phase Epoch Dataset train Dataset test Language Test Test samples WER (%)
IchigoWhisper 2 5 viVoice + LibriTTS-R viVoice Vi 1000 14.46
IchigoWhisper 2 20 viVoice + LibriTTS-R viVoice Vi 1000 12.91
PhoWhisper Large 1.55B - - - viVoice Vi 1000 23.08
Whisper Medium 0.76B - - - viVoice Vi 1000 18.64
  1. LibriTTS-R
Model Name Phase Epoch Dataset train Dataset test Language Test Test samples WER (%)
IchigoWhisper 2 5 viVoice + LibriTTS-R LibriTTS-R En 1000 16.78
IchigoWhisper 2 20 viVoice + LibriTTS-R LibriTTS-R En 1000 17.22
PhoWhisper Large 1.55B - - - LibriTTS-R En 1000 59.72
Whisper Medium 0.76B - - - LibriTTS-R En 1000 12.99

[Merge Codebooks] IchigoWhisper Phase 1 50e, Phase 2 5e

How to merge?

image

Codebook Size = 2049 (IchigoWhisper w/ mask token) + 512 (WhisperVQ w/o mask token)
2048 w/ mask first, 512 later

# 1. Initial State
Codebook 512:  [512 codes + 1 mask token]
[C1 C2 C3 ... C512 M]

Codebook 2048: [2048 codes + 1 mask token]
[D1 D2 D3 ... D2048 M]

# 2. Remove Mask Token from 512
Codebook 512 (without mask):
[C1 C2 C3 ... C512]  # 512 codes

Codebook 2048 (keeps mask):
[D1 D2 D3 ... D2048 M]  # 2049 codes

# 3. Create New Empty Codebook
New Size = 512 + 2049 = 2561 codes
[_ _ _ ... _ _ _]  # 2561 empty slots

# 4. Merge Process
Step 2: Copy 2048+mask first
[D1 D2 D3 ... D2048 M | _ _ _ ... _ _ _ _ ]
 |----2049 codes----| |-----512 slots-----|

Step 2: Copy 512 codes after
[D1 D2 D3 ... D2048 M | C1 C2 C3 ... C512 |]
 |----2049 codes----| |-----512 codes-----|

Experiments

  1. viVoice
Model Name Phase Codebook Size Base VQ Dataset train Dataset test Language Test Test samples WER
IchigoWhisper 2 2561 en+pl viVoice + LibriTTS-R viVoice Vi 1000 11.39
IchigoWhisper 2 2049 v3-7lang viVoice + LibriTTS-R viVoice Vi 1000 14.46
IchigoWhisper 2 2561 v3-7lang viVoice + LibriTTS-R viVoice Vi 1000 11.36
PhoWhisper Large 1.55B - - - - viVoice Vi 1000 23.08
Whisper Medium 0.76B - - - - viVoice Vi 1000 18.64
  1. LibriTTS-R
Model Name Phase Codebook Size Base VQ Dataset train Dataset test Language Test Test samples WER
IchigoWhisper 2 2561 en+pl viVoice + LibriTTS-R LibriTTS-R En 1000 12.96
IchigoWhisper 2 2049 v3-7lang viVoice + LibriTTS-R LibriTTS-R En 1000 16.78
IchigoWhisper 2 2561 v3-7lang viVoice + LibriTTS-R LibriTTS-R En 1000 13.01
PhoWhisper Large 1.55B - - - - LibriTTS-R En 1000 59.72
Whisper Medium 0.76B - - - - LibriTTS-R En 1000 12.99