task: Train and test text2semantic under decoder only framework for ichigo v0.5
tikikun opened this issue · 19 comments
Motivation
Since ichigo v0.5 will support additional language that will make the traditional t2s obsolete. This is a good chance to introduce a t2s framework that we have full control over.
Goal
Be able to handle any arbitrary language
- Current Ichigo approach (WhisperVQ) is only trained for 7 languages
- We cannot find a ASR or STT module for every new language
- Our approach: Text to Semantic (same semantic space as Ichigo speech embeddings)
Methodology
- Teacher: Speech -> semantic (Whisper encoder)
- Student: Text -> semantic (decoder, Qwen base model)
First step: Run base case first, with English first (before exploring other languages)
- WhisperSpeech text-to-semantic model failed for our synthetic data pipeline, producing incompatible sound tokens that disrupted Ichigo’s comprehension.
- Proposed solution: a custom decoder-only text-to-semantic model (<3B parameters, similar to Qwen 2.5) with knowledge transfer from WhisperVQ and a more efficient architecture.
- Processed 10k English samples from MLS Eng 10k dataset (2.42M samples) using WhisperVQ for semantic token extraction, adding a <|text_to_semantic|> task token in user turns.
- Example dataset: Instruction data.
- Modified Qwen 2.5 0.5B model:
- Introduced
<|text_to_semantic|>
task token and added 512 sound tokens + 3 special tokens (start, end, mask) to its vocabulary. - Trained with instruction-based samples for text-to-semantic conversion.
- Embedded control tokens without modifying the tokenizer, scaled embedding layer and LM head to
[152,192](https://github.com/QwenLM/Qwen/issues/419)
tokens for training speed optimization.
- Introduced
What needed to be done:
- Train t2s on a decoder model to test feasibility (we already have the data) under ichigo v0.4
- Train t2s on decoder model on new quantizer after #146 is done
Experiments
Run ID | Date | Model Config | Dataset | Learning Rate | Batch Size | Steps | Loss | Hardware |
---|---|---|---|---|---|---|---|---|
exp-t2s-0.5B | 2024-11-28 | Full-Finetune | Instruction text to sound sementic token | 1e-3 | 96 | 28810 | 1.6-1.7 | ~ 4 hours on 2xH100 |
exp-t2s-1.5B-1 | 2024-11-29 | Full-Finetune | Instruction text to sound sementic token | 1e-3 | 84 | 28810 | 2.64 | ~ 10 hours on 6xA6000 |
exp-t2s-1.5B-2 | 2024-11-30 | Full-Finetune | Instruction text to sound sementic token | 1e-4 | 84 | 28810 | 1.84 | ~ 10 hours on 6xA6000 |
exp-t2s-llama3.2-1B | 2024-12-1 | Full-Finetune | Instruction text to sound sementic token | 1e-4 | 96 | 25208 | 1.73 | ~ 9 hours on 6xA6000 |
exp-t2s-llama3.2-1B-1 | 2024-12-2 | Full-Finetune | Instruction text to sound sementic token | 1.5e-4 | 192 | 12604 | 1.77 | ~ 6 hours on 6xA6000 |
exp-t2s-llama3.2-1B-2 | 2024-12-3 | Full-Finetune | Instruction text to sound sementic token | 1.5e-4 | 192 | 57930 | 1.44 | ~30 hours on 6xA6000 |
exp-t2s-llama3.2-1B-dedup | 2024-12-5 | Full-Finetune | Instruction deduplicated sound sementic token | 1.5e-4 | 168 | 33000 | 1.73 | ~30 hours on 6xA6000 |
exp-t2s-llama3.2-1B-compress | 2024-12-9 | Full-Finetune | Instruction compressed sound sementic token | 1.5e-4 | 168 | 44713 | 1.6-1.63 | ~33 hours on 6xA6000 |
exp-t2s-llama3.2-1B-compress-1 | 2024-12-10 | Full-Finetune | Instruction compressed sound sementic token | 3e-4 | 256 | 57930 | 1.54 | ~24 hours on 6xA6000 |
exp-t2s-llama3.2-3B-compress | 2024-12-13 | Full-Finetune | Instruction compressed sound sementic token | 1.5e-4 | 192 | 40607 | 1.59 | ~60 hours on 6xA6000 |
exp-t2s-llama3.2-3B | 2024-12-13 | Full-Finetune | Instruction text to sound sementic token | 3e-4 | 192 | 77241 | 1.36 | ~60 hours on 6xA6000 |
Run ID | Date | Model Config | Dataset | Learning Rate | Batch Size | Steps | Loss | Hardware |
---|---|---|---|---|---|---|---|---|
Ichigo-t2s-1B-vie | 2024-12-24 | Full-Finetune | Vivoice instruction compressed sound sementic token | 1e-4 | 192 | 23115 | 1.42 | ~7.5 hours on 6xA6000 |
Ichigo-t2s-1B-vie+en-1 | 2024-12-25 | Full-Finetune | Vivoice+MLS_Eng_10k instruction compressed sound sementic token | 1e-4 | 288 | Early stop at epoch 3 | 1.98 | ~22.5 hours on 6xA6000 |
Ichigo-t2s-1B-vie+en-2 | 2024-12-25 | Full-Finetune | Vivoice+MLS_Eng_10k instruction compressed sound sementic token | 1e-4 | 288 | Early stop at epoch 3 | On-going | ~22.5 hours on 6xA6000 |
Ichigo-t2s-1B-vie+en-3 | 2024-12-25 | Full-Finetune | Vivoice+Libris_r_flitered_112k instruction compressed sound sementic token | 1e-4 | 288 | On-going | On-going | ~10 hours on 6xA6000 |
Test Results:
- Sample Synthetic data generation (text to semantics) Using Different Repetition Settings
Experiment ID | Inference Result | Note |
---|---|---|
exp-t2s-llama3.2-1B-1 | exp-t2s-llama3.2-1B-1-result | Model don't know who to generate <sound_end> token, have repetitive and hallucination problem. |
exp-t2s-llama3.2-1B-2 | exp-t2s-llama3.2-1B-2 | Successfully train a Text to Semantic model, model can generate the sound token that Ichigo v0.4 can understand but The model performance heavily rely on repetition_penalty hyperparameter if i set this number to 1.00 the model have hallucination and reptition problem and ichigo can't understand |
exp-t2s-llama3.2-1B-compress | exp-t2s-llama3.2-1B-compress | After 2 epochs, the training loss converged to 1.6, so we early stop the training. Despite the relatively high convergence loss, the model successfully demonstrated the ability to generate sound tokens when evaluated on the test set. |
Benchmarking
-
Using WhisperVQ dequantize the tokens back to embedding and then using whisper model to decode this emebedding into text. Benchmark on LibrisSpeech Clean test set
Model WER (%) Synthetic-T2S-1B-compress-with-prompt 7.37 WhisperVQ 7 lang with prompt 5.33 Synthetic-T2S-1B-compress 11.52 WhisperVQ 7 lang 10.86 Ichigo-t2s-1B-vie+en-1-epoch2 23.43 Ichigo-t2s-1B-vie+en-1-epoch3 23.83 Ichigo-t2s-1B-vie+en-2-epoch2 46.83 Ichigo-t2s-1B-vie+en-2-epoch3 29.68 Ichigo-t2s-1B-vie+en-3-epoch2 3.99 after debugging Ichigo-t2s-1B-vie+en-3-epoch3 7.88 Ichigo Quantizer epoch 5 of phase 2 11.16 -
Using WhisperVQ dequantize the tokens back to embedding and then using whisper model to decode this emebedding into text. Benchmark on Bud500 test set
Model WER (%) Ichigo-t2s-1B-vie 6.09 Ichigo-t2s-1B-vie+en-1-epoch2 4.5 Ichigo-t2s-1B-vie+en-1-epoch3 6.31 Ichigo-t2s-1B-vie+en-2-epoch2 3.7 Ichigo-t2s-1B-vie+en-2-epoch3 3.47 Ichigo-t2s-1B-vie+en-3-epoch2 3.27 (after debugging) Ichigo-t2s-1B-vie+en-3-epoch2 3.52 -
Using AudioBench.
Goal
Be able to handle any arbitrary language
- Current Ichigo approach (WhisperVQ) is only trained for 7 languages
- We cannot find a ASR or STT module for every new language
- Our approach: Text to Semantic (same semantic space as Ichigo speech embeddings)
Methodology
- After extensive hyperparameter tuning, the WhisperSpeech text-to-semantic model proved inadequate for our synthetic data pipeline. The model's output resulted in incompatible sound tokens that broke Ichigo's comprehension capabilities.
To address this, we propose developing a custom text-to-semantic model based on a decoder-only architecture (similar to Qwen 2.5) with <3B parameters. This model will leverage knowledge transfer from WhisperVQ while maintaining a more efficient architecture that better aligns with our use case. - To develop our custom text-to-semantic model, I processed 10k English samples from the MLS Eng 10k (2.42M samples) dataset by tokenizing the raw speech using WhisperVQ to extract semantic tokens. I also add a special task token <|text_to_semantic|> in user turn. Here is a sample from Instruction data:
[ { "content": "<|text_to_sementic|>he telegraphed to general pemberton that he had learned sherman was between them with four divisions at clinton saying that it was important to reestablish communications that pemberton might be reenforced and directing him to come up in sherman's rear at once", "role": "user" }, { "content": "<|sound_start|><|sound_0209|><|sound_0134|><|sound_0134|><|sound_0134|><|sound_0241|><|sound_0222|><|sound_0239|><|sound_0329|><|sound_0197|><|sound_0115|><|sound_0409|><|sound_0196|><|sound_0196|><|sound_0196|><|sound_0196|><|sound_0196|><|sound_0235|><|sound_0487|><|sound_0487|><|sound_0459|><|sound_0459|><|sound_0405|><|sound_end|>", "role": "assistant" } ]
- I then modified Qwen 2.5 0.5B by introducing a new task token <|text_to_semantic|> and incorporating 515 sound tokens into its vocabulary. The training data was structured as instruction-based samples designed to teach the model text-to-semantic token conversion. Following Qwen's methodology, we integrated control tokens into the embedding layer without tokenizer modification which, according to Qwen Authors, aim at optimizing training performance, we adjusted the embedding layer and LM head dimensions by a factor of 128, resulting in a final embedding size of 152,192 tokens.
Need to add more details to this issue:
- Goal: Be able to handle any arbitrary language
- Current Ichigo approach (WhisperVQ) is only trained for 7 languages
- We cannot find a ASR or STT module for every new language
- Our approach: Text to Semantic (same semantic space as Ichigo speech embeddings)
Please help me to align nomenclature etc. @tikikun's diagram above is very helpful.
I move the table to the top for better visualization cc @bachvudinh
This task is a hybrid between Text-to-speech and speech-to-speech translation. It is quite hard because there is a one-to-many mapping between input text, and possible output token combinations.
Here are two papers that are using the same AR setting, but for slightly different tasks. I think it can be adapted.
AudioPALM: https://arxiv.org/pdf/2306.12925
Valle-E: https://arxiv.org/pdf/2301.02111
Specifically, I think we can use Valle-E's idea of using a phoneme conversion layer before sending the text into the AR model, this might bridge the gap to the semantic embeddings abit, making the AR model's job easier. We also need to somehow provide some auxiliary information about the expected acoustic ground-truth that we are using, otherwise, if we provide text-only to the AR model, there are too many possible correct answers, so across multiple samples the loss may conflict.
However, I think it will be hard to make this work. The AR model needs a better constraint.
My proposal
In the WhisperSpeech framework, the text-to-semantic model is the inverse of the whisper decoder. We need to involve the whisper decoder in the training.
- Keep the same AR model structure
- However, instead of trying to get the model to predict the whipserVQ codes, send continuous embeddings into the frozen whisper decoder. What we are trying to do is get the AR decoder model to trick the whisper decoder into thinking it is seeing output from the whisper encoder.
- Compute the loss of the whisper decoder output to the original text.
You will meet a practical challenge, which is that while training this AR decoder model, its acting like its a NAR encoder model to the Whisper Decoder. There might be a smart way to solve this, but I can't think of one at the moment, or you can just use a NAR model.
Another (Simpler) Idea
If we really want an AR model trained using next token prediction, we must use WhisperVQ tokens in the current format and we don't want to add auxiliary information, we can try a simple intervention of grouping identical WhisperVQ tokens together. This way, the model is not penalized for getting the output length wrong.
i.e this original example:
<|sound_start|><|sound_0209|><|sound_0134|><|sound_0134|><|sound_0134|><|sound_0241|><|sound_0222|><|sound_0239|><|sound_0329|><|sound_0197|><|sound_0115|><|sound_0409|><|sound_0196|><|sound_0196|><|sound_0196|><|sound_0196|><|sound_0196|><|sound_0235|><|sound_0487|><|sound_0487|><|sound_0459|><|sound_0459|><|sound_0405|><|sound_end|>
get mapped to this:
<|sound_start|><|sound_0209|><|sound_0134|><|sound_0241|><|sound_0222|><|sound_0239|><|sound_0197|><|sound_0115|><|sound_0409|><|sound_0196|><|sound_0235|><|sound_0487|><|sound_0459|><|sound_0405|><|sound_end|>
This way the order of the token output matters, but the number of consecutively repeated tokens do not matter.
We can worry about upsampling the number of tokens as a separate problem. It might not matter to the decoder because the whole token sequence gets cross-attention anyway. The repeated tokens might not be adding that much information. During Fine-tuning we can apply a similar filtering to the WhisperVQ token stream, to see if the performance changes. Actually, if repeated tokens don't impact performance, then actually it makes inference even faster.
Updated from Research sync 2024-12-4:
- The current model have repetitive problem that generate endless sound semantic token and don't know how to generate
<|sound_end|>
token to stop. - The latest on-going runs that we scale the data to
3.7M samples
, change the model tollama3.1 1B base
version and doing3 epoch
. The loss metric stabilizes between1.4-1.5
after the second epoch and shows no further improvement during the third epoch - @PodsAreAllYouNeed suggested that we should deduplicate sound token when making training data. The repeated tokens might not be adding that much information and its partition to how fast a word is pronounce. I will setup data and as soon as the exp-t2s-llama3.2-1B-2 experiment finish, I will start the training on the 6xA6000.
cc @PodsAreAllYouNeed @tikikun
I do some testing with whisperVQ to see if the number of repeated semantic token proportion to how fast or slow a word are pronounce:
Audio Content | Audio Length | Speech Rate | Semantic Token Output | Tokens Length |
---|---|---|---|---|
Hello World | 1s | Fast | [207, 207, 207, 207, 207, 143, 153, 153, 13, 13, 13, 137, 228, 228, 1, 407, 1, 1, 222, 1, 1, 1, 1, 1, 11, 11, 207, 356, 348, 207, 356, 386, 207, 130] |
34 |
Hello World | 3s | Slow | [314, 336, 500, 207, 207, 153, 13, 208, 13, 13, 137, 228, 137, 228, 228, 228, 322, 378, 407, 407, 407, 378, 200, 200, 407, 378, 200, 378, 200, 200, 407, 200, 407, 78, 407, 378, 407, 378, 200, 407, 200, 200, 192, 407, 407, 407, 192, 192, 397, 397, 397, 479, 341, 479, 245, 245, 245, 245, 245, 35, 56, 245, 400, 376, 446, 378, 400, 192, 192, 400, 200, 200, 213, 207, 200, 207, 200, 207, 207, 207, 200, 508, 508, 508, 508, 207, 336, 348, 336, 508, 336, 508, 200] |
93 |
Hello | 1s | Fast | [207, 207, 207, 207, 143, 22, 207, 153, 179, 13, 13, 137, 228, 228, 322, 434, 434, 286, 286, 286, 286, 356, 207, 22, 382, 207] |
26 |
Hello | 1s | Slow | [207, 200, 153, 153, 153, 135, 13, 313, 313, 313, 313, 378, 378, 378, 378, 378, 378, 378, 378, 13, 392, 392, 228, 228, 378, 228, 275, 275, 275, 378, 275, 200, 227, 275, 200, 426, 407, 200, 434, 382, 382, 407, 382, 207, 207, 508] |
46 |
What I tried to do:
-
Deduplicate the sound semantic token to get the shorter sound token sequence --> Efficient Training.
For example:
Input:<|sound_start|><|sound_0207|><|sound_0148|><|sound_0000|><|sound_0249|><|sound_0249|><|sound_0177|>. <|sound_0177|><|sound_0177|><|sound_0177|><|sound_0177|><|sound_0302|><|sound_0302|><|sound_0426|>. <|sound_0426|><|sound_0081|><|sound_0090|><|sound_0491|><|sound_0327|><|sound_0327|><|sound_0153|>. <|sound_0061|><|sound_0061|><|sound_0196|><|sound_0178|><|sound_0129|><|sound_0129|><|sound_0025|>. <|sound_0070|><|sound_0114|><|sound_0159|><|sound_0159|><|sound_0159|><|sound_0010|><|sound_0316|>. <|sound_0270|><|sound_0510|><|sound_0409|><|sound_0339|><|sound_0378|><|sound_0339|><|sound_0378|>. <|sound_0407|><|sound_0407|><|sound_0404|><|sound_0404|><|sound_0235|><|sound_0322|><|sound_0322|>. <|sound_0076|><|sound_0245|><|sound_0226|><|sound_0011|><|sound_0407|><|sound_0407|><|sound_0007|>. <|sound_0007|><|sound_0007|><|sound_0187|><|sound_0316|><|sound_0300|><|sound_0316|><|sound_0316|>. <|sound_0109|><|sound_0008|><|sound_0300|><|sound_0179|><|sound_0179|><|sound_0179|><|sound_0122|>. <|sound_0122|><|sound_0158|><|sound_0158|><|sound_0245|><|sound_0080|><|sound_0080|><|sound_0461|>. <|sound_0461|><|sound_0005|><|sound_0368|><|sound_0368|><|sound_0487|><|sound_0206|><|sound_0278|>. <|sound_0278|><|sound_0002|><|sound_0002|><|sound_0102|><|sound_0329|><|sound_0444|><|sound_0114|>. <|sound_0114|><|sound_0325|><|sound_0202|><|sound_0202|><|sound_0202|><|sound_0363|><|sound_0319|>. <|sound_0161|><|sound_0147|><|sound_0397|><|sound_0325|><|sound_0081|><|sound_0081|><|sound_0319|>. <|sound_0161|><|sound_0331|><|sound_0187|><|sound_0350|><|sound_0210|><|sound_0048|><|sound_0468|>. <|sound_0468|><|sound_0125|><|sound_0271|><|sound_0271|><|sound_0157|><|sound_0344|><|sound_0007|>. <|sound_0300|><|sound_0300|><|sound_0300|><|sound_0300|><|sound_0300|><|sound_0076|><|sound_0226|>. <|sound_0498|><|sound_0498|><|sound_0330|><|sound_0468|><|sound_0468|><|sound_0468|><|sound_0434|>. <|sound_0182|><|sound_end|>
output:
<|sound_start|><|sound_0207|><|sound_0148|><|sound_0000|><|sound_0249|><|sound_0177|><|sound_0302|>. <|sound_0426|><|sound_0081|><|sound_0090|><|sound_0491|><|sound_0327|><|sound_0153|><|sound_0061|>. <|sound_0196|><|sound_0178|><|sound_0129|><|sound_0025|><|sound_0070|><|sound_0114|><|sound_0159|>. <|sound_0010|><|sound_0316|><|sound_0270|><|sound_0510|><|sound_0409|><|sound_0339|><|sound_0378|>. <|sound_0339|><|sound_0378|><|sound_0407|><|sound_0404|><|sound_0235|><|sound_0322|><|sound_0076|>. <|sound_0245|><|sound_0226|><|sound_0011|><|sound_0407|><|sound_0007|><|sound_0187|><|sound_0316|>. <|sound_0300|><|sound_0316|><|sound_0109|><|sound_0008|><|sound_0300|><|sound_0179|><|sound_0122|>. <|sound_0158|><|sound_0245|><|sound_0080|><|sound_0461|><|sound_0005|><|sound_0368|><|sound_0487|>. <|sound_0206|><|sound_0278|><|sound_0002|><|sound_0102|><|sound_0329|><|sound_0444|><|sound_0114|>. <|sound_0325|><|sound_0202|><|sound_0363|><|sound_0319|><|sound_0161|><|sound_0147|><|sound_0397|>. <|sound_0325|><|sound_0081|><|sound_0319|><|sound_0161|><|sound_0331|><|sound_0187|><|sound_0350|>. <|sound_0210|><|sound_0048|><|sound_0468|><|sound_0125|><|sound_0271|><|sound_0157|><|sound_0344|>. <|sound_0007|><|sound_0300|><|sound_0076|><|sound_0226|><|sound_0498|><|sound_0330|><|sound_0468|>. <|sound_0434|><|sound_0182|><|sound_end|>
-
I experimented with Llama 3.1 1B using deduplicated data for 3 epoch.
Result:
- The loss did not go lower than 1.7 after 2 epoch (~ 1.73-1.8).
- Model is unable to identify and learn the data pattern, possibly due to information loss caused by deduplication.
Idea: Add duration tokens
Observations:
- Loss did not converge as well after we de-dup, something must be lost
- Repetition is the only difference, how many time a sound token is repeated
- Continuous embedding is a by product of how Whisper Encoder process (each embedding is slightly different from other one even though they are almost the same)
Theories:
- We need to be able to "add back the "repetition" information
- After we make sure the information is kept the same, the loss behaviour should be relatively the same.
Implementation:
- We create "<|dur_xxx|> (with x is an integer) to represent the repetition inside each repeated sound tokens group
Extra Information
Why some word might result in repetition?
At first glance it's tempting to think that the same information in a repetitive token (or embedding) might be redundant. But if we take a closer look at long and short vowels in English it might not be the case
Example:
- Sheep - i
- Ship - ɪ
The only way for you to discern the difference between Sheep and Ship sometimes in English speaking is only whether the i sound is long or short, or the duration part of it.
By de-duplicating the duration, everything becomes a short sound but your target training still sheep or sip (either long or short) make it impossible to really converge.
Hence, duration is the information that is left out when you de-dup.
Related to the idea of the token-level duration token, we could potentially have a "global duration token" added as a context token either before or after the provided text input. This "global duration token" gives information to the t2s model about the length of semantic tokens it needs to generate. After training, this global duration token is also used to control the length of generation of the text, which controls the speaking speed. This is inspired by the "number of frames" mechanic found in the F5-TTS code generation https://github.com/SWivid/F5-TTS/blob/8898d05e374bcb8d3fc0b1286037e95df61f491f/src/f5_tts/infer/utils_infer.py#L449C1-L452C96
If TTS models need some global duration information in order to do the generation, then our text2semantic should also use the same kind of global information. we just need to encode it a little differently.
Idea: Add duration tokens
Observations:
- Loss did not converge as well after we de-dup, something must be lost
- Repetition is the only difference, how many time a sound token is repeated
- Continuous embedding is a by product of how Whisper Encoder process (each embedding is slightly different from other one even though they are almost the same)
Theories:
- We need to be able to "add back the "repetition" information
- After we make sure the information is kept the same, the loss behaviour should be relatively the same.
Implementation:
- We create "<|dur_xxx|> (with x is an integer) to represent the repetition inside each repeated sound tokens group
Extra Information
Why some word might result in repetition?
At first glance it's tempting to think that the same information in a repetitive token (or embedding) might be redundant. But if we take a closer look at long and short vowels in English it might not be the case
Example:
- Sheep - i
- Ship - ɪ
The only way for you to discern the difference between Sheep and Ship sometimes in English speaking is only whether the i sound is long or short, or the duration part of it.
By de-duplicating the duration, everything becomes a short sound but your target training still sheep or sip (either long or short) make it impossible to really converge.
Hence, duration is the information that is left out when you de-dup.
This idea needs further validation on different generation len
@bachvudinh please add validation on longer sequence
-
Testing the quality of text-to-semantic models using prompts of varying lengths:
Prompt length Prompt sound token length Ichigo v0.4 Answer 30 Can you please tell a story about a cat who loves to play with a ball of yarn? 136 What a delightful request! Here's a tale about a feline friend who loves to play with a ball of yarn: Whiskers and the Yarn Ball In a cozy little house on a quiet street, there lived a sleek and curious cat named Whiskers. Whiskers was a ball of energy, always looking for ways to entertain herself. One day, while exploring the house, she stumbled upon a ball of yarn that had been left on the floor. The soft, fluffy yarn was like a treasure trove of wonder to Whiskers. She batted it around the room, chasing after it with glee. From that moment on, Whiskers was hooked. The yarn ball became her favorite toy, and she would spend hours playing with it every day. She'd chase it around the house, pouncing on it like a predator, and then gently batting it back and forth with her paw. The yarn ball was the perfect size, just big enough to be challenging but small enough to be manageable. As the days went by, Whiskers' love for the yarn ball only grew stronger. She'd carry it around in her mouth, proudly displaying her prize to anyone who would look. Her owners, the Smiths, would laugh at her antics, watching as she'd chase the yarn ball around the living room, leaping over furniture and dodging around chairs. But Whiskers' favorite game was "Yarn Ball Hide-and-Seek." She'd hide the yarn ball under a cushion, behind a bookshelf, or even in the laundry basket, and then use her keen senses to track it down. When she finally found it, she'd let out a triumphant meow, as if to say, "I've got it! I've got it!" 78 Summarize below paragraph: If we really want an AR model trained using next token prediction, we must use Whisper tokens in the current format and we don't want to add auxiliary information, we can try a simple intervention of grouping identical Whisper tokens together. This way, the model is not penalized for getting the output length wrong. 685 The text describes a method for training an AI model using next token prediction. The model is trained using a specific format of word tokens and does not require any auxiliary information. The model can be trained using a simple technique of grouping identical word tokens together. This approach does not penalize the model for the output length on the token prediction. 82 Imagine you are a renowned architect designing a sustainable skyscraper in Singapore. The building needs to incorporate green technology, vertical gardens, and innovative energy solutions. Describe your vision for this groundbreaking structure, including specific details about materials, renewable energy systems, and how it will benefit the local community. Consider aspects like natural lighting, rainwater harvesting, and biodiversity integration. 2013 The Singapore SkyScape is a futuristic skyscraper that embodies the principles of green technology, vertical gardens, and innovative energy solutions. The building's design is a masterpiece of modern architecture, with a sleek and aerodynamic shape that maximizes natural light and ventilation. The exterior of the building is clad in a unique, translucent material that allows for maximum light penetration while also providing excellent insulation and energy efficiency. The SkyScape's vertical gardens are a key feature of the building's design. The gardens are located on the exterior of the building, and are designed to provide a natural habitat for a variety of plant species. The gardens are irrigated using a state-of-the-art hydroponic system, which uses recycled water and minimizes water waste. The gardens also provide a natural air filtration system, helping to purify the air and improve the overall air quality in the surrounding area. The SkyScape's innovative energy solutions are another key feature of the building's design. The building is powered by a combination of solar panels, wind turbines, and geothermal energy. The solar panels are located on the roof of the building, and provide a significant portion of the building's energy needs. The wind turbines are located on the exterior of the building, and provide additional energy. 145 Summarize the below paragraph: The importance of effective communication in everyday life cannot be overstated. Clear and concise communication helps to build strong relationships, avoid misunderstandings, and convey ideas with precision. When individuals communicate effectively, they are able to express their thoughts, needs, and desires in a way that is easily understood by others. Good communication involves not only speaking, but also listening actively and responding thoughtfully. It requires being aware of nonverbal cues, such as body language and tone of voice, and adapting one's message to the audience and context. By doing so, individuals can navigate complex social situations, resolve conflicts, and achieve their goals more efficiently. 2015 Effective communication is crucial in everyday life as it helps build strong relationships, avoid misunderstandings, and convey ideas with precision. Clear and concise communication is key to achieving this, and it is essential to communicate effectively in every part of the country. While computers can communicate with each other, they are limited to simple social situations and cannot resolve conflicts with humans. Therefore, it is important to communicate effectively with humans to avoid conflicts and achieve our goals. 144 Repeat what i said below: The importance of effective communication in everyday life cannot be overstated. Clear and concise communication helps to build strong relationships, avoid misunderstandings, and convey ideas with precision. When individuals communicate effectively, they are able to express their thoughts, needs, and desires in a way that is easily understood by others. Good communication involves not only speaking, but also listening actively and responding thoughtfully. It requires being aware of nonverbal cues, such as body language and tone of voice, and adapting one's message to the audience and context. By doing so, individuals can navigate complex social situations, resolve conflicts, and achieve their goals more efficiently. 814 The importance of effective communication in everyday life cannot be overstated. Clear and concise communication helps to build strong relationships, avoid misunderstandings, and convey ideas with precision. When individuals communicate effectively and respond thoughtfully, they complete the communication process. The employee's information has been completed. The communication has been completed. The communication process has been completed.
-
Result from exp-t2s-llama3.2-base-1B-compress on LibriSpeech clean test set:
-
[Detail WER Result of Llama3.2-instruct-1B-compress]
-
Detail WER Result of Llama3.2-base-1B-compress
Model WER (%) llama3.2-instruct-38-raw-v2 300.52 llama3.2-instruct-18-compress 131.75 llama3.2-base-18-compress 101.03 WhisperVQ 7 lang 10.86 WhisperVQ English+Polish 5.81
-
-
Audio Bench result of Llama3.2-base-1B-compress:
- Open-hermes : 3.52. Details result can be found here: openhermes_audio_test_gpt4_judge_score.json.
- Alpaca : 3.52. Details result can be found here: alpaca_audio_test_gpt4_judge_score.json.
-
Updated result for llama3.1-3B-raw-semantic token(epoch 2):
-
Librispeech : 300.52%. stats_1.csv
-
Audio bench:
Openhermes: 3.56. openhermes_audio_test_gpt4_judge_score_T2S.json
Alpaca: 3.55. alpaca_audio_test_gpt4_judge_score_T2S.json
-
-
The Result for llama3.1-3B-raw-semantic token(3 epoch):
- Librispeech:
- Audio bench:
Openhermes: 3.41.
Alpaca: 3.49.
A good point of reference for the results for the text-to-semantic model would be the F5-TTS model.
We can also use their methodology of Librispeech-PC test-clean to test. Also, we can send the text into F5-TTS to generate, then take the generated audio and put it back into the full whisperVQ+Decoder model for ASR, as an additional point of reference. (This only needs to be done once for each time we retrain the VQ)
Looking at the table from F5-TTS, it seems that a good target for the WER of the t2s -> decoder round-trip would be 2-3% on the LibriSpeech test-clean dataset.
Text-to-Semantics Training Issue Resolution
-
Issue Identified: Initially suspected a failure in Text-to-Semantics model training.
-
Root Cause Discovery: During a 2-hour pair-coding session, identified that training data contained a terminal period (".") that was missing from evaluation data.
-
Solution: Add the terminal period during evaluation, which successfully restored T2S model functionality.
-
Key Learnings:
- Separator tokens are critical components in decoder model performance, reinforcing concepts from today's brown bag presentation.
- Small details in data formatting can have significant impacts on research outcomes.
Word Error Rate (WER) Comparison Between Real Semantic tokens and Synthetic tokens.
- Detail result:
Model | WER (%) |
---|---|
Synthetic-T2S-1B-compress-with-prompt | 7.37 |
WhisperVQ 7 lang with prompt | 5.33 |
Synthetic-T2S-1B-compress | 11.52 |
WhisperVQ 7 lang | 10.86 |
Note: "With prompt" refers to adding a prompt to the Whisper decoder.