training run: tuning ichigo-quantizer

Question

training run: tuning ichigo-quantizer

Opened this issue 25 days ago · 15 comments

Answer 1 · 2024-12-02T15:29:04.000Z

Acknowledgment:
Great work by @tuanlda78202 on adding code to use weights from the WhisperVQ 7-language quantizer checkpoint.
Pull Request:
PR janhq/WhisperSpeech#7 on WhisperSpeech
Impact:
This approach could save significant time compared to training from scratch.

Answer 2 · 2024-12-03T19:22:10.000Z

Observation:
Dataset mostly contains sequences <200 tokens (30s audio), leading to high KL loss.
Data distribution:
Testing:
- Adjusted max tokens to 20, 50, and 200.
- Results:
  - At max tokens = 20, KL loss dropped to 4.
  - At max tokens = 200, KL loss > 10.
  - Loss test:

Problem:
Excessive padding tokens during training inflated the loss.
Solution:
Implement dataset packing to reduce padding tokens.

Current PR (WIP): janhq/WhisperSpeech#8

Answer 3 · 2024-12-04T22:30:23.000Z

Problem

Default codebook size of WhisperVQ is so small (512), it may not be effective for training on multilingual datasets; also, training from scratch requires so much data and time to converge to the best loss.
High loss when training and stuck stable, it cannot learn more.
Training only on new language datasets may decrease the performance of the English checkpoint pretrained VQ

Training on short audio (~3s) with padding is very inefficient.

KL loss can be divergence predictions output.

Solution

Init weights from a checkpoint of WhisperVQ-7lang (trained on multiple languages, good result).
Try to modify the architecture (codebook, dim), or data pipeline, do experiments, and verify hypotheses.
Trained on mixed language datasets

Concat multiple short audio files to a long audio file (30s)

Turn off KL loss when training

Implementation

Init weights from a checkpoint of WhisperVQ-7lang: WhisperVQ checkpoints have a codebook size (512) mismatch with our current increased codebook size (1024), so we filled the first 512 weights into the models and ran an experiment on avg embedding, Kaiming init and try duplicate next 512 with random noise with remaining tokens.

Try with different factors of KL loss to check if it impacts the main model.

Name	Runtime	batch_size	trainer/global_step	loss/total_train_step	loss/ce	loss/kl	loss/commit_loss	codebook/used_codes	codebook/utilization
init ckpt dim 1024 - kl 5	31m 1s	42	782	78.53279	1.11499	15.48318	0.0018883	858	83.78906
init ckpt dim 1024 - kl 2	1h 57m 1s	42	2924	30.53683	0.91597	14.80941	0.0020393	875	85.44922
init ckpt dim 1024 - kl 1.5	1h 58m 7s	42	2939	23.34618	0.84657	14.99841	0.0019927	886	86.52344
init ckpt dim 1024 - kl 3	2h 43m 4s	42	3726	45.72793	1.0146	14.90375	0.0020783	868	84.76563

Add mask logit before the softmax function KL loss to check if it can be a factor in the loss model.

Name	trainer/global_step	codebook/used_codes_step	codebook/utilization_step	loss/ce_loss_st	loss/commit_loss_step	loss/kl_loss_step	loss/total_train_step
diana_lavenderblush	3335	889	86.81641	0.29703	0.0013738	0.3107	0.60911

Check the raw dataset, including the length of the audio and distribution of text tokens; we found that in the original implementation of WhisperSpeech, the max token value was very high (200), and our data was so small (max 20) that it led to more padding and made the KL loss value still very high. We verified this hypothesis with different max_token values and saw that with max token equal to the max number of tokens in the dataset, it returned the lowest loss.

Name	Runtime	batch_size	trainer/global_step	loss/total_train_step	loss/ce	loss/kl	loss/commit_loss	codebook/used_codes	codebook/utilization
init_ckpt dim 1024 - 512 later random - bs42 - max_token=20	7h 32m 19s	42	15099	2.75478	0.82511	1.92655	0.00311208	968	94.53125
init_ckpt dim 1024 - 512 later random - bs42 - max_token=50	7h 43m 51s	42	15099	4.71717	0.7786	3.93544	0.00313314	991	96.77734
init_ckpt dim 1024 - 512 later random - bs8	1h 30m 42s	8	9999	15.83308	0.76369	15.06781	0.00151839	888	86.71875
init_ckpt dim 1024 - 512 later random - bs42 - max_token=200	1h 5m 41s	42	1689	15.96285	0.83534	15.12557	0.001936	983	95.99609
init_ckpt dim 1024 - 512 later avg	3h 21m 52s	42	4999	15.39613	0.80295	14.59104	0.00213388	865	84.47266

Test with trained quantizer, return better result compare with whisper medium (has many hallucination), result in this sheet.

Preview table of comparison

Audio ID	Ground Truth	Trained Quantizer Output	Whisper Output
audio_0_6	các bác sĩ có thể chăm sóc người bệnh	Các bác sĩ có thể chăm sóc người bệnh	để các bác sĩ có thể chăm sóc người bệnh.
audio_0_7	em bây giờ mới là hiện tại của anh ấy	Em bây giờ mới là hiện tại của anh ấy	Em bây giờ mới là hiện tại của anh ấy
audio_0_8	thôi anh đừng nói gì nữa tôi chưa đủ khổ	Thôi anh đừng nói gì nữa, chưa đủ khổ	Thôi anh đừng nói gì nữa. Tôi chưa đủ khổ.

Experiment with removing special_tokens when encoding text input; the result is very bad.

Name	Runtime	batch_	trainer/global_step	loss/total_train_step	loss/ce	loss/kl	loss/commit_loss	codebook/used_codes	codebook/utilization
no special tokens	4h 27m 40	80	6522	5.34698	3.75046	1.59528	0.0012379	867	84.58537

For speed up training, we removed WebDataset implemented in WhisperSpeech and using native DataLoader PyTorch, it reduced time data processed in CPU, boosted up GPUs utilization 100% and reduced time training from 13h -> 5h on a single A6000 GPU, supported multi-gpus training DDP
Trained on mixed (English + Vietnamese) data, including LibriTTS-R (27Gb, 112k train samples) for English, and Viet-Bud500 (98Gb, 630k samples), with weighted sampling between Vietnamese (70%) and English (30%) dataset (apply on batch training distribution.

Name	Runtime	batch_size	trainer/global_step	loss/total_train_step	loss/ce	loss/kl	loss/commit_loss	codebook/used_codes	codebook/utilization
init ckpt 1024 - 512 random - bs80 - max_token20 - mix_data	7952	80	3191	N/A	0.5281	1.4206	0.0020	964	94.05%
init ckpt 2048 - 512 dup noise - bs80 - max_token20 - mix_data	24916	80	9338	1.5538	0.4797	1.0716	0.0025	1928	94.09%
init ckpt 1024 - 512 dup noise - bs80 - max_token20 - mix_data	57964	80	23329	1.6326	0.4719	1.1567	0.0040	1006	98.17%
init ckpt 2048 - 512 random - bs80 - max_token20 - mix_data	58649	80	23329	1.3866	0.4406	0.9432	0.0028	1557	75.99%

Concat multiple short audios to a long audio (30s), set max_token=200, reduced training time 10x times (8 minutes/epoch) on mixed datasets, because when concatenate 30s, the current samples will group ~10, leading to a reduced number of training samples 10x.

Name	Runtime	batch_size	trainer/global_step	loss/total_train_step	loss/ce	loss/kl	loss/commit_loss	codebook/used_codes	codebook/utilization
scratch 2048 - largev3 - bs24 - max_token200 - mix_data_73 - 100e - ddp8	68056	24	32263	6.444209575653076	0.4545549750328064	5.978719234466553	0.010935629718005655	1266.125	61.792335510253906
init ckpt 2048 - 512 dup_noise - bs42 - max_token200 - mix_data_73 - wo_w_loss - 100e - ddp8	50272	42	24899	12.359258651733398	0.4022634625434875	11.945953369140623	0.011041682213544846	1974.25	96.35187530517578

Change Whisper from medium to large-v3. Found out many hallucination responses and mixed dataset lead to English bias when training, resulting in predictions_loss=5.47 - largev3.

Preview table of comparison

Audio ID	Ground Truth	Trained Quantizer Output	Whisper Output
audio_0_13	nơi đây và em thích con người ở đây em	one đây và em thích con người ở đây em	Hãy subscribe cho kênh Ghiền Mì Gõ Để không bỏ lỡ những video hấp dẫn
audio_0_15	cước được tắm suối mịn màng trắng sáng	ure được tắm suối miịn màng sắng sáng	Hãy subscribe cho kênh Ghiền Mì Gõ Để không bỏ lỡ những video hấp dẫn
audio_1_32	dũng mới cứu được tôi thôi chứ không còn	.ng can cứu được tôi thôi chứ không còn stubborn stubborn stubborn stubborn	Hãy subscribe cho kênh Ghiền Mì Gõ Để không bỏ lỡ những video hấp dẫn

Turn off KL loss, resume weights from phase 1 (trained 100e), training result: very good output predictions (better than whisper medium), updated inference at epoch 21 in here.

Name	Runtime	batch_size	trainer/global_step	loss/total_train_step	loss/ce	loss/kl	loss/commit_loss	codebook/used_codes	codebook/utilization
resume concat 100e 2048 - 512 dup_noise - bs42 - max_token200 - mix_data_73 concat true - remove_kl - 100e - ddp8	50834	42	24899	0.03272935375571251	0.0244855098426342	0	0.008243842981755733	1952.5	95.29039001464844

Preview table of comparison

Audio ID	Ground Truth	Trained Quantizer Output	Whisper medium Output
audio_1149	cảm giác ấm áp anh dành cho tôi anh vào	cảm giác ấm áp anh dành cho tôi anh vào	cảm giác ấm áp anh dành cho tôi
audio_1150	rằng những gì đang xảy ra điều trị làm	bằng những gì đang xảy ra điều trị làm	những gì đang xảy ra đều chỉ làm cho tình thường
audio_1151	đảo côn lôn thành tỉnh côn sơn trong	làào quôn lôn thành tìnhnh côn sơn trong	là cuốn luôn thành tỉnh cuốn sơn
audio_1152	có gì nhiều chỉ có hai mảnh đất mảnh	có gì nhiều chỉ có hai mảnh đất mảnh	empty
audio_1153	chỉ đến khi người con thứ của ông vua	chỉ đến khi người con thứ của ông vua	Chỉ đến khi người con thứ của ông vừa vô cùng vô cùng vô cùng vô cùng vô cùng vô cùng vô cùng vô cùng vô cùng vô cùng vô cùng vô cùng vô cùng vô cùng vô cùng vô cùng vô cùng vô cùng vô cùng vô cùng vô cùng vô cùng vô cùng vô cùng vô cùng vô cùng vô cùng vô cùng vô cùng vô cùng vô cùng vô cùng vô cùng vô cùng vô cùng vô cùng vô cùng vô cùng vô cùng vô cùng vô cùng vô cùng vô cùng vô cùng vô cùng vô cùng vô cùng vô cùng vô cùng vô cùng vô cùng vô cùng vô cùng vô cùng vô cùng vô cùng vô cùng vô cùng vô cùng vô cùng vô cùng vô cùng vô cùng vô cùng vô cùng vô cùng vô cùng vô cùng vô cùng vô cùng vô
audio_1154	ngập tràn nhưng cô vẫn giả vờ làm bộ mặt	ngập tràn nhưng cô vẫn giả vờ làm bộ mặt	nhưng cô vẫn giả vờ làm bộ mặt

After all experiments, we concluded that with duplicate 512 codebook init checkpoint with noises to remain values of weight, concat multiple short audios up to 30s, trained on mixed datasets, turn off KL loss, we returned the best results.

Answer 4 · 2024-12-10T06:46:20.000Z

Phase 1: kl (distillation loss) + ce loss
Phase 2: ce loss

Answer 5 · 2024-12-10T06:46:51.000Z

tikikun commented 17 days ago

Answer 6 · 2024-12-11T02:40:18.000Z

Problem

Very low WER result on both Vietnamese and English datasets, recheck evaluation pipeline.
Test only on Vietnamese text normalized (lowercase, no punctuation, etc. ) in Bud500 is not efficient for real-world applications.

Solution

Check if the inference pipeline is autoregressive or not.
Test on ViVoice dataset has natural transcript (text capitalized, has punctuation, etc.).

Implementation

Fix inference bug, remove input_toks in forward pass and apply autoregressive whisper.DecodingOptions()

WER comparison

Dataset	Language	Trained Quantizer	Pho Whisper	Whisper Medium
Bud500	Vi	0.22	0.08	0.62
LibriTTS-R	En	1.90	0.46	0.12

Test checkpoint on phase 1 (100e, use KL loss) and phase 2 (21e, not use KL loss) training

Preview table of comparison (ckpt phase 2)

Audio ID	Ground Truth	Trained Quantizer Output	PhoWhisper Large Output	Whisper medium Output
audio_10	Đại tá Trần Đình Hưng, Phó Chỉ huy trưởng, Tham mưu trưởng, Bộ Chỉ huy quân sự tỉnh	đại tá trần định hưng phó chỉ huy trưởng tham mưu trưởng bầu chỉ quy quân sự tỉnh	đại tá trần đình hưng phó chỉ huy trưởng tham mưu trưởng bộ chỉ huy quân sự tỉnh	Đại tá Trần Đình Hưng, phó chỉ quy trưởng, tham mưu trưởng, Bộ Chủ quy quân sự tỉnh

Preview table of comparison (ckpt phase 1, better)

Audio ID	Ground Truth	Trained Quantizer Output	PhoWhisper Large Output	Whisper medium Output
audio_10	Đại tá Trần Đình Hưng, Phó Chỉ huy trưởng, Tham mưu trưởng, Bộ Chỉ huy quân sự tỉnh	Đại tá Trần Đình Hưng, Phó Chỉ quy trưởng, Tham mưu trưởng, Bộ Chính quyên quân sự tỉnh	đại tá trần đình hưng phó chỉ huy trưởng tham mưu trưởng bộ chỉ huy quân sự tỉnh	Đại tá Trần Đình Hưng, phó chỉ quy trưởng, tham mưu trưởng, Bộ Chủ quy quân sự tỉnh

Answer 7 · 2024-12-11T05:11:27.000Z

Errors in data samplings

We should treat clean and finished data like libri separately, there is no need for excess concatenation like below.
Concatenation should only happen in low resource languages.
Reduce the number of training epochs.
Process the data before running the experiment, AVOID MAKING DYNAMIC DATASET ON THE GO BY MOOD AT ALL COST.

Answer 8 · 2024-12-12T00:44:56.000Z

Next test run:

2 phases:

Phase 1: KL Loss + CE Loss
Phase 2: CE Loss
10 epochs for each phase

What to validate

Is the English degradation is due to wrong data samplings (above error?)
Are we overtrain with current 100 epochs?

Only after validated the above points we can move forwards with next steps

cc @tuanlda78202 TBD today

Answer 9 · 2024-12-13T05:41:10.000Z

Problem

Are we overtraining with the current 100 epochs for both training phases?
Is the English degradation due to concatenating audio and the sampling distribution being skewed to the Vietnamese language?

Solution

Reduce the training epochs for both training phases.
Train on the original English (non-concatenate) dataset and balance the sampling distribution on a per-batch basis to regularize the model.

Implementation

Train two phases, each phase with 10 epochs. Phase 1 turns off KL loss and weights the dataset distribution to 0.5 for concatenating Bud500 and non-concatenating LibriTTS-R.

Name	Epoch	KL Loss	Val Loss	Val Acc
Phase 1	10	On	15.56	0.84
Phase 2	10	Off	0.21	0.94

Inference checkpoints are saved after 10 epochs for both phases on ViVoice (100 samples) and LibriTTS-R.


ViVoice first 100 samples (Vietnamese)


LibriTTS-R (English)

Conclusion

High-quality datasets have an important impact on quantizer performance.
Overtraining phase 1 with KL loss makes the model more generalized.
Continual training phase 2 without KL loss on a few epochs helps the model avoid overfitting.

Answer 10 · 2024-12-16T03:41:03.000Z

Training model on high-quality datasets

Problem
Solution
Results
- Phase 1 (with KL loss)
  - Result on val dataset
  - Testing
    - LibriTTS-R Tests
    - viVoice Tests

Problem

Training model on low-quality datasets (Bud500) lead to poor performance

Solution

Training model on high-quality datasets (viVoice)

Results

Phase 1 (with KL loss)

Training on viVoice (868k samples in jan-hq) and LibriTTS-R (112k samples), not use concat 30s dataset, early stopping if accuracy during 10 epochs does not improve

# Implementation of accuracy metric for early stopping
def _update_validation_metrics(self, logits, output_toks):
    valid_toks = output_toks != -100
    self.val_true += (
        (logits.detach().argmax(-1)[valid_toks] == output_toks[valid_toks])
        .float()
        .sum()
    )
    self.val_total += valid_toks.float().sum()

def get_metrics(self):
    metrics = {
        "acc_0": (self.val_true / self.val_total).item(),
    }
    self.val_true[:] = 0
    self.val_total[:] = 0
    return metrics

Result on `val` dataset

Exp ID	Number of samples	Best epoch	Training time	Accuracy	Loss
p1-vivoice+librittsr	10000	29	2d 12h 29m 37s	0.89	14.59

Visualization Loss & Accuracy on val phase


Val Accuracy per epoch


Val Loss per epoch

Testing

Summary results

Model Name	Phase	Epoch	Dataset train	Dataset test	Language Test	Test samples	WER
Ichigo Quantizer	1	10	Bud500 + LibriTTS-R	LibriTTS-R	En	4689	0.56
Ichigo Quantizer	1	29	viVoice + LibriTTS-R	LibriTTS-R	En	4689	0.13
Ichigo Quantizer	2	100	Bud500 + LibriTTS-R	LibriTTS-R	En	4689	1.90
Ichigo Quantizer	2	10	Bud500 + LibriTTS-R	LibriTTS-R	En	4689	0.22
PhoWhisper Large	-	-	-	LibriTTS-R	En	4689	0.47
Whisper Medium	-	-	-	LibriTTS-R	En	4689	0.12

Model Name	Phase	Epoch	Dataset train	Dataset test	Language Test	Test samples	WER
Ichigo Quantizer	1	29	viVoice + LibriTTS-R	viVoice	Vi	10000	0.21
PhoWhisper Large	-	-	-	viVoice	Vi	10000	0.23
Whisper Medium	-	-	-	viVoice	Vi	10000	0.18

LibriTTS-R


LibriTTS-R (English)


Preview some predictions on LibriTTS-R (English)

viVoice


viVoice (Vietnamese)


Preview some predictions on viVoice (Vietnamese)

Answer 11 · 2024-12-17T02:08:45.000Z

Phase 2 (without KL loss)

Phase 1 (29e), Phase 2 10e

viVoice

Model Name	Phase	Epoch	Dataset train	Dataset test	Language Test	Test samples	WER	WER (%)
Ichigo Quantizer	2	2	viVoice + LibriTTS-R	viVoice	Vi	10000	0.20	20
PhoWhisper Large	-	-	-	viVoice	Vi	10000	0.24	24
Whisper Medium	-	-	-	viVoice	Vi	10000	0.18	18
Ichigo Quantizer	2	9	viVoice + LibriTTS-R	viVoice	Vi	1000	0.11	11
PhoWhisper Large	-	-	-	viVoice	Vi	1000	0.23	23
Whisper Medium	-	-	-	viVoice	Vi	1000	0.18	18

LibriTTS-R

Model Name	Phase	Epoch	Dataset train	Dataset test	Language Test	Test samples	WER	WER (%)
Ichigo Quantizer	2	2	viVoice + LibriTTS-R	LibriTTS-R	En	4689	1.23	123
PhoWhisper Large	-	-	-	LibriTTS-R	En	4689	0.47	47
Whisper Medium	-	-	-	LibriTTS-R	En	4689	0.12	12
Ichigo Quantizer	2	9	viVoice + LibriTTS-R	LibriTTS-R	En	1000	0.70	70
PhoWhisper Large	-	-	-	LibriTTS-R	En	1000	0.59	59
Whisper Medium	-	-	-	LibriTTS-R	En	1000	0.12	12

Add prompt into whisper.DecodingOptions when testing

prompt = f"You are a professional transcriber, fluent in {prefix_lang}. You are listening to a recording in which a person is potentially speaking {prefix_lang}, and no other languages. They may have a strong accent. You are to transcribe utterances of {prefix_lang} accordingly"

Model Name	Prompt	Phase	Epoch	Dataset train	Dataset test	Language Test	Test samples	WER	WER (%)
Ichigo Quantizer	No	2	9	viVoice + LibriTTS-R	LibriTTS-R	En	1000	0.93	93
PhoWhisper Large	No	-	-	-	LibriTTS-R	En	1000	0.59	59
Whisper Medium	No	-	-	LibriTTS-R	En	1000	0.12	12
Ichigo Quantizer	Yes	2	9	viVoice + LibriTTS-R	viVoice	Vi	1000	0.13	13
PhoWhisper Large	No	-	-	-	viVoice	Vi	1000	0.23	23
Whisper Medium	Yes	-	-	-	viVoice-R	Vi	1000	0.17	17

Answer 12 · 2024-12-19T05:02:51.000Z

Phase 1 full epoch (100e), Phase 2 10e [Ongoing]

Phase 1

viVoice

Model Name	Phase	Epoch	Dataset train	Dataset test	Language Test	Test samples	WER	WER (%)
Ichigo Quantizer	1	29	viVoice + LibriTTS-R	viVoice	Vi	10000	0.21	21
PhoWhisper Large	-	-	-	viVoice	Vi	10000	0.23	23
Whisper Medium	-	-	-	viVoice	Vi	10000	0.18	18
Ichigo Quantizer	1	62	viVoice + LibriTTS-R	viVoice	Vi	1000	0.18	18
PhoWhisper Large	-	-	-	viVoice	Vi	1000	0.23	23
Whisper Medium	-	-	-	viVoice	Vi	1000	0.18	18

LibriTTS-R

Model Name	Phase	Epoch	Dataset train	Dataset test	Language Test	Test samples	WER	WER (%)
Ichigo Quantizer	1	29	viVoice + LibriTTS-R	LibriTTS-R	En	4689	0.13	13
PhoWhisper Large	-	-	-	LibriTTS-R	En	4689	0.47	47
Whisper Medium	-	-	-	LibriTTS-R	En	4689	0.12	12
Ichigo Quantizer	1	62	viVoice + LibriTTS-R	LibriTTS-R	En	1000	0.13	13
PhoWhisper Large	-	-	-	LibriTTS-R	En	1000	0.59	59
Whisper Medium	-	-	-	LibriTTS-R	En	1000	0.12	12

Answer 13 · 2024-12-19T13:22:36.000Z

Bug: Incorrect Mask Generation After Audio Padding

Description

During code review, we discovered that the mask generation after audio padding is incorrectly implemented. The current code creates masks with all 1s for padded audio. Given the padded audio with the padding value = 0:

Input Audio: [1, 2, 3, 4, 5, 0, 0, 0, 0, 0]

# Current problematic code
concatenated_audio = self.pad_audio(concatenated_audio)
mask = torch.zeros(30 * 16000 // 320, dtype=torch.bool)
audio_frames = min(len(concatenated_audio), self.max_audio_length) // 320
mask[:audio_frames] = 1  # Bug: This includes padding tokens

The training code reserves a special VQ token for padding (vq_codes + 1).
However, with mask=1 everywhere, ~mask becomes all zeros:
So instead of creating a mask: Expected Mask: [1, 1, 1, 1, 1, 0, 0, 0, 0, 0] @tuanlda78202 created a buggy mask: [1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
Result: The special padding embedding is never applied cause ~mask will be all 0s.

# Impact on training
if self.training and self.config.mask_embs and mask is not None:
    x[~mask] = project_out(self.rq.layers[0]._codebook.embed[0, self.vq_codes])

Impact

The padding tokens is not trained.
Model sees raw embeddings in padding regions instead of consistent padding tokens. Leading to longer semantic token sequence with fixed length = 750 when encode audio.

Next Steps

Fix mask generation when padding audio.
Add validation test for mask generation.

This is solved by this PR : janhq/WhisperSpeech#19.

Answer 14 · 2024-12-20T02:33:06.000Z

[Add PAD tokens] IchigoWhisper Phase 1 50e, Phase 2 20e

Phase 1

viVoice

Model Name	Phase	Epoch	Dataset train	Dataset test	Language Test	Test samples	WER (%)
IchigoWhisper	1	2	viVoice + LibriTTS-R	viVoice	Vi	1000	25.92
IchigoWhisper	1	2	viVoice + LibriTTS-R	viVoice	Vi	10000	23.66
IchigoWhisper	1	39	viVoice + LibriTTS-R	viVoice	Vi	10000	22.37
IchigoWhisper	1	49	viVoice + LibriTTS-R	viVoice	Vi	10000	20.75
PhoWhisper Large 1.55B	-	-	-	viVoice	Vi	1000	23.08
Whisper Medium 0.76B	-	-	-	viVoice	Vi	1000	20.45
PhoWhisper Large 1.55B	-	-	-	viVoice	Vi	10000	24.00
Whisper Medium 0.76B	-	-	-	viVoice	Vi	10000	17.78

LibriTTS-R

Model Name	Phase	Epoch	Dataset train	Dataset test	Language Test	Test samples	WER (%)
IchigoWhisper	1	2	viVoice + LibriTTS-R	LibriTTS-R	En	4689	16.82
IchigoWhisper	1	39	viVoice + LibriTTS-R	LibriTTS-R	En	4689	18.81
IchigoWhisper	1	49	viVoice + LibriTTS-R	LibriTTS-R	En	4689	17.46
PhoWhisper Large 1.55B	-	-	-	LibriTTS-R	En	4689	47.52
Whisper Medium 0.76B	-	-	-	LibriTTS-R	En	4689	13.06

Phase 2 (Ongoing)

Resume from phase 1 49e checkpoint

viVoice

Model Name	Phase	Epoch	Dataset train	Dataset test	Language Test	Test samples	WER (%)
IchigoWhisper	2	5	viVoice + LibriTTS-R	viVoice	Vi	1000	14.46
IchigoWhisper	2	20	viVoice + LibriTTS-R	viVoice	Vi	1000	12.91
PhoWhisper Large 1.55B	-	-	-	viVoice	Vi	1000	23.08
Whisper Medium 0.76B	-	-	-	viVoice	Vi	1000	18.64

LibriTTS-R

Model Name	Phase	Epoch	Dataset train	Dataset test	Language Test	Test samples	WER (%)
IchigoWhisper	2	5	viVoice + LibriTTS-R	LibriTTS-R	En	1000	16.78
IchigoWhisper	2	20	viVoice + LibriTTS-R	LibriTTS-R	En	1000	17.22
PhoWhisper Large 1.55B	-	-	-	LibriTTS-R	En	1000	59.72
Whisper Medium 0.76B	-	-	-	LibriTTS-R	En	1000	12.99

Answer 15 · 2024-12-26T05:50:05.000Z

[Merge Codebooks] IchigoWhisper Phase 1 50e, Phase 2 5e

How to merge?

Codebook Size = 2049 (IchigoWhisper w/ mask token) + 512 (WhisperVQ w/o mask token)
2048 w/ mask first, 512 later

# 1. Initial State
Codebook 512:  [512 codes + 1 mask token]
[C1 C2 C3 ... C512 M]

Codebook 2048: [2048 codes + 1 mask token]
[D1 D2 D3 ... D2048 M]

# 2. Remove Mask Token from 512
Codebook 512 (without mask):
[C1 C2 C3 ... C512]  # 512 codes

Codebook 2048 (keeps mask):
[D1 D2 D3 ... D2048 M]  # 2049 codes

# 3. Create New Empty Codebook
New Size = 512 + 2049 = 2561 codes
[_ _ _ ... _ _ _]  # 2561 empty slots

# 4. Merge Process
Step 2: Copy 2048+mask first
[D1 D2 D3 ... D2048 M | _ _ _ ... _ _ _ _ ]
 |----2049 codes----| |-----512 slots-----|

Step 2: Copy 512 codes after
[D1 D2 D3 ... D2048 M | C1 C2 C3 ... C512 |]
 |----2049 codes----| |-----512 codes-----|

Experiments

viVoice

Model Name	Phase	Codebook Size	Base VQ	Dataset train	Dataset test	Language Test	Test samples	WER
IchigoWhisper	2	2561	`en+pl`	viVoice + LibriTTS-R	viVoice	Vi	1000	11.39
IchigoWhisper	2	2049	`v3-7lang`	viVoice + LibriTTS-R	viVoice	Vi	1000	14.46
IchigoWhisper	2	2561	`v3-7lang`	viVoice + LibriTTS-R	viVoice	Vi	1000	11.36
PhoWhisper Large 1.55B	-	-	-	-	viVoice	Vi	1000	23.08
Whisper Medium 0.76B	-	-	-	-	viVoice	Vi	1000	18.64

LibriTTS-R

Model Name	Phase	Codebook Size	Base VQ	Dataset train	Dataset test	Language Test	Test samples	WER
IchigoWhisper	2	2561	`en+pl`	viVoice + LibriTTS-R	LibriTTS-R	En	1000	12.96
IchigoWhisper	2	2049	`v3-7lang`	viVoice + LibriTTS-R	LibriTTS-R	En	1000	16.78
IchigoWhisper	2	2561	`v3-7lang`	viVoice + LibriTTS-R	LibriTTS-R	En	1000	13.01
PhoWhisper Large 1.55B	-	-	-	-	LibriTTS-R	En	1000	59.72
Whisper Medium 0.76B	-	-	-	-	LibriTTS-R	En	1000	12.99

training run: tuning ichigo-quantizer

Goal

Hypothesis

Task:

Training Result:

Problem

Solution

Implementation

Problem

Solution

Implementation

Errors in data samplings

Next test run:

What to validate

Problem

Solution

Implementation

Conclusion

Training model on high-quality datasets

Problem

Solution

Results

Phase 1 (with KL loss)

Result on `val` dataset

Testing

Phase 2 (without KL loss)

Phase 1 (29e), Phase 2 10e

Phase 1 full epoch (100e), Phase 2 10e [Ongoing]

Phase 1

Bug: Incorrect Mask Generation After Audio Padding

Description

Impact

Next Steps

[Add PAD tokens] IchigoWhisper Phase 1 50e, Phase 2 20e

Phase 1

Phase 2 (Ongoing)

[Merge Codebooks] IchigoWhisper Phase 1 50e, Phase 2 5e

How to merge?

Experiments

Goal

Hypothesis

Task:

Training Result:

Problem

Solution

Implementation

Problem

Solution

Implementation

Errors in data samplings

Next test run:

What to validate

Problem

Solution

Implementation

Conclusion

Training model on high-quality datasets

Problem

Solution

Results

Phase 1 (with KL loss)

Result on val dataset

Testing

Phase 2 (without KL loss)

Phase 1 (29e), Phase 2 10e

Phase 1 full epoch (100e), Phase 2 10e [Ongoing]

Phase 1

Bug: Incorrect Mask Generation After Audio Padding

Description

Impact

Next Steps

[Add PAD tokens] IchigoWhisper Phase 1 50e, Phase 2 20e

Phase 1

Phase 2 (Ongoing)

[Merge Codebooks] IchigoWhisper Phase 1 50e, Phase 2 5e

How to merge?

Experiments

Result on `val` dataset