Training code
Opened this issue · 24 comments
Hi
I noticed in a previous issue that you mentioned planning to release training code "in the near future". I'm trying to decide whether to reproduce training from scratch or wait for the official code. I want to train (or fine-tune?) this model on Indonesian/English and possibly Spanish/English too.
Even a rough estimate (next month vs later this year) would be really helpful for planning and preventing double work.
Thanks!
Hi! Sorry for the late response. We've just released Moshi fine-tuning code: https://github.com/kyutai-labs/moshi-finetune
This should allow you to train Hibiki as well, since it's Moshi-based. However, Hibiki-specific data preparation code is not included. The timeline on that is certainly more "later this year" than "next month", I'm afraid.
Hi! Sorry for the late response. We've just released Moshi fine-tuning code: https://github.com/kyutai-labs/moshi-finetune
This should allow you to train Hibiki as well, since it's Moshi-based. However, Hibiki-specific data preparation code is not included. The timeline on that is certainly more "later this year" than "next month", I'm afraid.
Hopefully the data preparation code will soon realeased
Hi! Sorry for the late response. We've just released Moshi fine-tuning code: https://github.com/kyutai-labs/moshi-finetune
This should allow you to train Hibiki as well, since it's Moshi-based. However, Hibiki-specific data preparation code is not included. The timeline on that is certainly more "later this year" than "next month", I'm afraid.
Can you at least show us the format of the dataset you think would be good? It would be easier to start from there
The Hibiki paper describes the data preparation in detail: https://arxiv.org/pdf/2502.03382
Please see Section 4.2 Training protocol
Hello Team Moshi and Hibiki @LaurentMazare @vvolhejn
Below is a cleaned-up summary of my current situation and what I need help with:
How I'm Preparing My Dataset and Pipeline
-
Single JSONL File (
data.jsonl):{"path": "/path/Speech-To-Speech-Translation/data/eng_audio/STEREO_common_voice_sw_30623601.mp3.wav", "duration": 19.076}- This points to one stereo wave file where channel 1 is Swahili (source), and channel 2 is English (target).
-
Alignments for Target (English) Audio:
{ "alignments": [ ["He", [4.2, 5.66], "SPEAKER_MAIN"], ["loved", [5.66, 5.9], "SPEAKER_MAIN"], ... ["kind.", [8.48, 8.78], "SPEAKER_MAIN"] ] }- These timestamps describe the target (English) words in channel 2.
- My goal is for the target audio to start only after the source has finished talking (like consecutive translation).
-
Interleaver Config:
- I changed
keep_main_only=True→False, hoping it would include both channels or avoid dropping certain alignments.
- I changed
-
Loss Calculation:
- I'm still using
mb_loss = text_loss + audio_loss, expecting that the text tokens might help improve translation quality.
- I'm still using
Train.py and Config
- Train Script: I didn’t rewrite much beyond switching
keep_main_only=False. - YAML Config snippet:
data: train_data: '/path/data/data.jsonl' shuffle: true moshi_paths: hf_repo_id: "kyutai/hibiki-1b-pytorch-bf16" full_finetuning: false lora: enable: true rank: 512 scaling: 2.0 first_codebook_weight_multiplier: 100.0 text_padding_weight: 0.5 duration_sec: 20 batch_size: 4 max_steps: 250 gradient_checkpointing: true optim: lr: 2.e-4 weight_decay: 0.1 pct_start: 0.05 run_dir: "/path/Speech-To-Speech-Translation/test"
- I’m using partial finetuning with LoRA, a fairly large LR (2e-4), a short max sequence length (20s), etc.
Core Issue
After training, my model doesn’t produce any audible output at inference time (it “doesn’t speak”). I’m trying to figure out why. The stereo wave is set up so:
- Channel 1 (source): Swahili, no text alignment.
- Channel 2 (target): English, with
SPEAKER_MAINalignments.
I assume the source chunk ends, and then the target chunk should come in. But I’m not getting any voiced output.
- I might be missing something in how I feed the stereo wave → separate channels.
- Or maybe the
keep_main_only=Falseisn’t enough to keep the relevant alignments. - Possibly the large LR or short sequence context is an issue.
- Or maybe my pipeline for generating (inference) is incomplete.
Any guidance on making the model produce the target speech is appreciated!
Hi!
As your goal is for the target audio to start only after the source has finished talking, you need to ensure that the training data you use also satisfies this property. In particular, you have to add silence before the target audio to shift it into the future with respect to the source when building the stereo wavs. You also need to shift the timestamps of the alignment accordingly.
The max sequence length should be set to roughly twice the max length of the original source/target audio samples from your data.
Note that kyutai/hibiki-1b-pytorch-bf16 was trained on fr->en streaming translation only. So it might require some efforts to adapt it to another source language (like Swahili) as well as another latency setup (offline translation in your case). You should maybe consider alternative approaches like cascaded ASR->MT->TTS systems if latency is not your main constraint.
Thanks for the quick reply!
I’m facing latency issues with my ASR→MT→TTS pipeline. For the dataset I have prepared, I’ve tried adding 0.5 × SR of silence to the data between source and target, and I’m curious why you’d suggest making the max sequence length about twice as long. I’ve already adjusted the alignment timestamps accordingly.
Also, did you mean I should fully pre-train the language model from scratch or adapt my existing one to the new scenario?
I understood you were trying to perform sentence-level translation (start translating a sentence when it's over) and dealing with single sentence audios, that's why I advised to use twice the max input length as the generation sequence length so that it gives some time to the model to produce its translation after the end of the input speech.
Now I understand that you add a constant delay of 0.5s between source and target. That might not be sufficient depending on the linguistic complexity of your data (see Table 4. in our paper https://arxiv.org/pdf/2502.03382 for example for delay ablations).
Adapting the model to new languages/latency setup are still open research areas so I don't have definite answer but fine-tuning is likely to be your best option if you don't have enough data for pretraining.
Hi!
For #13 (comment):
Regarding dataset, it is said in the repo https://github.com/kyutai-labs/moshi-finetune/tree/main?tab=readme-ov-file#-prepare-dataset that:
The pipeline expects a dataset of stereo audio files, the left channel is used for the audio generated by moshi, whereas the second channel is used for the user's input.
@msamwelmollel, could you check your files whether inputs are swapped, cause you mentioned that:
Channel 1 (source): Swahili, no text alignment.
Channel 2 (target): English, with SPEAKER_MAIN alignments.
I'm facing similar issues with hibiki training.
For data preprocessing, the output audio is delayed by 2 seconds.
For training, I finetuning the whole model.
After finetuning, the model spits out some random text with no correlation to the input audio. On the other hand, the output audio is correctly delayed from text by 2 seconds and match it perfectly. Nevertheless, it seems like the model ignores the input audio data completely. Any thoughts? I'd much appreciate it if you could help me
Hello again!
After of couple of trials, I got the code running for Estonian to English translation, here are some details:
- I used sentence level alignment with 1 second delay between input-output audios
- Make sure to have EOS token at the end of the output text (otherwise the model will continue generating some nonsense)
- Make sure that the left channel is used for the audio generated by moshi, whereas the second channel is used for the user's input.
- I performed full finetuning on a single A100 GPU
All in all, if you don't miss tiny details from the paper and the repo, the translation works well
Hello again!
After of couple of trials, I got the code running for Estonian to English translation, here are some details:
- I used sentence level alignment with 1 second delay between input-output audios
- Make sure to have EOS token at the end of the output text (otherwise the model will continue generating some nonsense)
- Make sure that the left channel is used for the audio generated by moshi, whereas the second channel is used for the user's input.
- I performed full finetuning on a single A100 GPU
All in all, if you don't miss tiny details from the paper and the repo, the translation works well
Hello @daniilrobnikov could you share your repo with me I would real appreciate that!
Hello again!
After of couple of trials, I got the code running for Estonian to English translation, here are some details:
- I used sentence level alignment with 1 second delay between input-output audios
- Make sure to have EOS token at the end of the output text (otherwise the model will continue generating some nonsense)
- Make sure that the left channel is used for the audio generated by moshi, whereas the second channel is used for the user's input.
- I performed full finetuning on a single A100 GPU
All in all, if you don't miss tiny details from the paper and the repo, the translation works well
my email: msamwelmollel@gmail.com
I can share my repo and point out where I went wrong. Please reply @daniilrobnikov
Hi, if I would like to fine-tune Hibiki for English -> Mandarin translation, is it possible to know how much data is needed and how the sample dataset would look like?
Hello again!
After of couple of trials, I got the code running for Estonian to English translation, here are some details:
* I used sentence level alignment with 1 second delay between input-output audios * Make sure to have EOS token at the end of the output text (otherwise the model will continue generating some nonsense) * Make sure that _the left channel is used for the audio generated by moshi, whereas the second channel is used for the user's input._ * I performed full finetuning on a single A100 GPUAll in all, if you don't miss tiny details from the paper and the repo, the translation works well
Is it possible to understand how you introduced the 1 second delay as well?
Hello again!
After of couple of trials, I got the code running for Estonian to English translation, here are some details:* I used sentence level alignment with 1 second delay between input-output audios * Make sure to have EOS token at the end of the output text (otherwise the model will continue generating some nonsense) * Make sure that _the left channel is used for the audio generated by moshi, whereas the second channel is used for the user's input._ * I performed full finetuning on a single A100 GPUAll in all, if you don't miss tiny details from the paper and the repo, the translation works well
Is it possible to understand how you introduced the 1 second delay as well?
When you create channels, I guess you add a 1-second delay between left and right! This is according to my understanding. But I wish he could share his repo! Since I tried all of these and I couldn't get at least some responses
@msamwelmollel, I will post the repo within a month or so
Is it possible to understand how you introduced the 1 second delay as well?
@jasonngap1, so on the right "input" channel - no changes; for the left "output" channel, I delayed audio by duration of "input audio" and plus 1 second. The same delay is applied to "start" and "end" fields of alignments in json files.
@msamwelmollel, I will post the repo within a month or so
Thanks, please, when I am stuck, help me out! I want to try what you have recommended!
Is it possible to understand how you introduced the 1 second delay as well?
@jasonngap1, so on the right "input" channel - no changes; for the left "output" channel, I delayed audio by duration of "input audio" and plus 1 second. The same delay is applied to "start" and "end" fields of alignments in json files.
Got it, will see if I can try this out soon
@msamwelmollel, I will post the repo within a month or so
Thanks alot 🙏🏻this will be helpful 👍🏻
Is it possible to understand how you introduced the 1 second delay as well?
@jasonngap1, so on the right "input" channel - no changes; for the left "output" channel, I delayed audio by duration of "input audio" and plus 1 second. The same delay is applied to "start" and "end" fields of alignments in json files.
Got it, will see if I can try this out soon
One month now! Can you at least share the code privatelly
@msamwelmollel, I will post the repo within a month or so
Any plan to open source or share the code?
@msamwelmollel, I will post the repo within a month or so
any luck with the repo?