Training code

Question

Training code

Opened this issue 8 months ago · 24 comments

Hi

I noticed in a previous issue that you mentioned planning to release training code "in the near future". I'm trying to decide whether to reproduce training from scratch or wait for the official code. I want to train (or fine-tune?) this model on Indonesian/English and possibly Spanish/English too.

Even a rough estimate (next month vs later this year) would be really helpful for planning and preventing double work.

Thanks!

Answer 1 · 2025-04-02T10:05:42.000Z

Hi! Sorry for the late response. We've just released Moshi fine-tuning code: https://github.com/kyutai-labs/moshi-finetune

This should allow you to train Hibiki as well, since it's Moshi-based. However, Hibiki-specific data preparation code is not included. The timeline on that is certainly more "later this year" than "next month", I'm afraid.

Answer 2 · 2025-04-08T05:00:45.000Z

Hi! Sorry for the late response. We've just released Moshi fine-tuning code: https://github.com/kyutai-labs/moshi-finetune

This should allow you to train Hibiki as well, since it's Moshi-based. However, Hibiki-specific data preparation code is not included. The timeline on that is certainly more "later this year" than "next month", I'm afraid.

Hopefully the data preparation code will soon realeased

Answer 3 · 2025-04-10T18:00:02.000Z

Hi! Sorry for the late response. We've just released Moshi fine-tuning code: https://github.com/kyutai-labs/moshi-finetune

This should allow you to train Hibiki as well, since it's Moshi-based. However, Hibiki-specific data preparation code is not included. The timeline on that is certainly more "later this year" than "next month", I'm afraid.

Can you at least show us the format of the dataset you think would be good? It would be easier to start from there

Answer 4 · 2025-04-11T12:33:21.000Z

The Hibiki paper describes the data preparation in detail: https://arxiv.org/pdf/2502.03382

Please see Section 4.2 Training protocol

Answer 5 · 2025-04-16T07:05:25.000Z

Hello Team Moshi and Hibiki @LaurentMazare @vvolhejn

Below is a cleaned-up summary of my current situation and what I need help with:

How I'm Preparing My Dataset and Pipeline

Single JSONL File (data.jsonl):
```
{"path": "/path/Speech-To-Speech-Translation/data/eng_audio/STEREO_common_voice_sw_30623601.mp3.wav", "duration": 19.076}
```
- This points to one stereo wave file where channel 1 is Swahili (source), and channel 2 is English (target).
Alignments for Target (English) Audio:
```
{
  "alignments": [
    ["He", [4.2, 5.66], "SPEAKER_MAIN"],
    ["loved", [5.66, 5.9], "SPEAKER_MAIN"],
    ...
    ["kind.", [8.48, 8.78], "SPEAKER_MAIN"]
  ]
}
```
- These timestamps describe the target (English) words in channel 2.
- My goal is for the target audio to start only after the source has finished talking (like consecutive translation).
Interleaver Config:
- I changed keep_main_only=True → False, hoping it would include both channels or avoid dropping certain alignments.
Loss Calculation:
- I'm still using mb_loss = text_loss + audio_loss, expecting that the text tokens might help improve translation quality.

Train.py and Config

Train Script: I didn’t rewrite much beyond switching keep_main_only=False.

YAML Config snippet:

data:
  train_data: '/path/data/data.jsonl'
  shuffle: true

moshi_paths:
  hf_repo_id: "kyutai/hibiki-1b-pytorch-bf16"

full_finetuning: false
lora:
  enable: true
  rank: 512
  scaling: 2.0

first_codebook_weight_multiplier: 100.0
text_padding_weight: 0.5

duration_sec: 20
batch_size: 4
max_steps: 250
gradient_checkpointing: true

optim:
  lr: 2.e-4
  weight_decay: 0.1
  pct_start: 0.05

run_dir: "/path/Speech-To-Speech-Translation/test"

I’m using partial finetuning with LoRA, a fairly large LR (2e-4), a short max sequence length (20s), etc.

Core Issue

After training, my model doesn’t produce any audible output at inference time (it “doesn’t speak”). I’m trying to figure out why. The stereo wave is set up so:

Channel 1 (source): Swahili, no text alignment.
Channel 2 (target): English, with SPEAKER_MAIN alignments.

I assume the source chunk ends, and then the target chunk should come in. But I’m not getting any voiced output.

I might be missing something in how I feed the stereo wave → separate channels.
Or maybe the keep_main_only=False isn’t enough to keep the relevant alignments.
Possibly the large LR or short sequence context is an issue.
Or maybe my pipeline for generating (inference) is incomplete.

Any guidance on making the model produce the target speech is appreciated!

Answer 6 · 2025-04-16T09:08:58.000Z

Hi!

As your goal is for the target audio to start only after the source has finished talking, you need to ensure that the training data you use also satisfies this property. In particular, you have to add silence before the target audio to shift it into the future with respect to the source when building the stereo wavs. You also need to shift the timestamps of the alignment accordingly.

The max sequence length should be set to roughly twice the max length of the original source/target audio samples from your data.

Note that kyutai/hibiki-1b-pytorch-bf16 was trained on fr->en streaming translation only. So it might require some efforts to adapt it to another source language (like Swahili) as well as another latency setup (offline translation in your case). You should maybe consider alternative approaches like cascaded ASR->MT->TTS systems if latency is not your main constraint.

Answer 7 · 2025-04-16T09:53:11.000Z

Thanks for the quick reply!
I’m facing latency issues with my ASR→MT→TTS pipeline. For the dataset I have prepared, I’ve tried adding 0.5 × SR of silence to the data between source and target, and I’m curious why you’d suggest making the max sequence length about twice as long. I’ve already adjusted the alignment timestamps accordingly.

Also, did you mean I should fully pre-train the language model from scratch or adapt my existing one to the new scenario?

Answer 8 · 2025-04-16T10:07:14.000Z

I understood you were trying to perform sentence-level translation (start translating a sentence when it's over) and dealing with single sentence audios, that's why I advised to use twice the max input length as the generation sequence length so that it gives some time to the model to produce its translation after the end of the input speech.

Now I understand that you add a constant delay of 0.5s between source and target. That might not be sufficient depending on the linguistic complexity of your data (see Table 4. in our paper https://arxiv.org/pdf/2502.03382 for example for delay ablations).

Adapting the model to new languages/latency setup are still open research areas so I don't have definite answer but fine-tuning is likely to be your best option if you don't have enough data for pretraining.

Answer 9 · 2025-05-08T16:35:30.000Z

Hi!

For #13 (comment):

Regarding dataset, it is said in the repo https://github.com/kyutai-labs/moshi-finetune/tree/main?tab=readme-ov-file#-prepare-dataset that:

The pipeline expects a dataset of stereo audio files, the left channel is used for the audio generated by moshi, whereas the second channel is used for the user's input.

@msamwelmollel, could you check your files whether inputs are swapped, cause you mentioned that:

Channel 1 (source): Swahili, no text alignment.
Channel 2 (target): English, with SPEAKER_MAIN alignments.

Answer 10 · 2025-05-08T16:48:24.000Z

I'm facing similar issues with hibiki training.
For data preprocessing, the output audio is delayed by 2 seconds.
For training, I finetuning the whole model.

After finetuning, the model spits out some random text with no correlation to the input audio. On the other hand, the output audio is correctly delayed from text by 2 seconds and match it perfectly. Nevertheless, it seems like the model ignores the input audio data completely. Any thoughts? I'd much appreciate it if you could help me

Answer 11 · 2025-05-13T10:29:59.000Z

Hello again!

After of couple of trials, I got the code running for Estonian to English translation, here are some details:

I used sentence level alignment with 1 second delay between input-output audios
Make sure to have EOS token at the end of the output text (otherwise the model will continue generating some nonsense)
Make sure that the left channel is used for the audio generated by moshi, whereas the second channel is used for the user's input.
I performed full finetuning on a single A100 GPU

All in all, if you don't miss tiny details from the paper and the repo, the translation works well

Answer 12 · 2025-05-13T10:46:48.000Z

Hello again!

After of couple of trials, I got the code running for Estonian to English translation, here are some details:

I used sentence level alignment with 1 second delay between input-output audios

Make sure to have EOS token at the end of the output text (otherwise the model will continue generating some nonsense)

Make sure that the left channel is used for the audio generated by moshi, whereas the second channel is used for the user's input.

I performed full finetuning on a single A100 GPU

All in all, if you don't miss tiny details from the paper and the repo, the translation works well

Hello @daniilrobnikov could you share your repo with me I would real appreciate that!

Answer 13 · 2025-05-14T09:49:09.000Z

Hello again!

After of couple of trials, I got the code running for Estonian to English translation, here are some details:

I used sentence level alignment with 1 second delay between input-output audios

Make sure to have EOS token at the end of the output text (otherwise the model will continue generating some nonsense)

Make sure that the left channel is used for the audio generated by moshi, whereas the second channel is used for the user's input.

I performed full finetuning on a single A100 GPU

All in all, if you don't miss tiny details from the paper and the repo, the translation works well

my email: msamwelmollel@gmail.com

I can share my repo and point out where I went wrong. Please reply @daniilrobnikov

Answer 14 · 2025-05-16T06:19:14.000Z

Hi, if I would like to fine-tune Hibiki for English -> Mandarin translation, is it possible to know how much data is needed and how the sample dataset would look like?

Answer 15 · 2025-05-16T06:49:44.000Z

Hello again!

After of couple of trials, I got the code running for Estonian to English translation, here are some details:
* I used sentence level alignment with 1 second delay between input-output audios

* Make sure to have EOS token at the end of the output text (otherwise the model will continue generating some nonsense)

* Make sure that _the left channel is used for the audio generated by moshi, whereas the second channel is used for the user's input._

* I performed full finetuning on a single A100 GPU
All in all, if you don't miss tiny details from the paper and the repo, the translation works well

Is it possible to understand how you introduced the 1 second delay as well?

Answer 16 · 2025-05-16T07:09:07.000Z

Hello again!
After of couple of trials, I got the code running for Estonian to English translation, here are some details:
* I used sentence level alignment with 1 second delay between input-output audios

* Make sure to have EOS token at the end of the output text (otherwise the model will continue generating some nonsense)

* Make sure that _the left channel is used for the audio generated by moshi, whereas the second channel is used for the user's input._

* I performed full finetuning on a single A100 GPU
All in all, if you don't miss tiny details from the paper and the repo, the translation works well
Is it possible to understand how you introduced the 1 second delay as well?

When you create channels, I guess you add a 1-second delay between left and right! This is according to my understanding. But I wish he could share his repo! Since I tried all of these and I couldn't get at least some responses

Answer 17 · 2025-05-16T10:07:09.000Z

@msamwelmollel, I will post the repo within a month or so

Answer 18 · 2025-05-16T10:13:04.000Z

Is it possible to understand how you introduced the 1 second delay as well?

@jasonngap1, so on the right "input" channel - no changes; for the left "output" channel, I delayed audio by duration of "input audio" and plus 1 second. The same delay is applied to "start" and "end" fields of alignments in json files.

Answer 19 · 2025-05-16T10:30:29.000Z

@msamwelmollel, I will post the repo within a month or so

Thanks, please, when I am stuck, help me out! I want to try what you have recommended!

Answer 20 · 2025-05-18T02:56:18.000Z

Is it possible to understand how you introduced the 1 second delay as well?

@jasonngap1, so on the right "input" channel - no changes; for the left "output" channel, I delayed audio by duration of "input audio" and plus 1 second. The same delay is applied to "start" and "end" fields of alignments in json files.

Got it, will see if I can try this out soon

Answer 21 · 2025-05-18T02:57:12.000Z

@msamwelmollel, I will post the repo within a month or so

Thanks alot 🙏🏻this will be helpful 👍🏻

Answer 22 · 2025-06-12T23:22:38.000Z

Is it possible to understand how you introduced the 1 second delay as well?

@jasonngap1, so on the right "input" channel - no changes; for the left "output" channel, I delayed audio by duration of "input audio" and plus 1 second. The same delay is applied to "start" and "end" fields of alignments in json files.

Got it, will see if I can try this out soon

One month now! Can you at least share the code privatelly

Answer 23 · 2025-07-05T20:36:44.000Z

@msamwelmollel, I will post the repo within a month or so

Any plan to open source or share the code?

Answer 24 · 2025-10-14T13:40:45.000Z

@msamwelmollel, I will post the repo within a month or so

any luck with the repo?