Question about training stage and dataset

Question

Question about training stage and dataset

Closed this issue 3 months ago · 3 comments

If my understand is correct, you trained the model though 2 stages:

Pretraining, which used the data you list in Paper Table 7.
Fine-tuning with Instruction data.
However, here is a detail confused me, does the base model you released is trained by pretraining stage? If yes, why the model can handle TTS task which is not in training progress.

Another question is I noticed there have lots of code about audio modality, I assume you team already prepared relative data and generated instruction for it. Why remove it finally? Does it will hurt speech or music relative task performance or some else reasons?

Thanks and expect your response.

Answer 1 · 2024-03-27T09:59:18.000Z

The first question: TTS is actually included in the pre-training tasks, comprising seven types in total: text-to-image, image caption, TTS, ASR, text-to-music, music caption, and image-text interleaved data.
The second question: Those are not the main process code and will not affect performance.

Answer 2 · 2024-03-27T10:16:02.000Z

Thanks for your answer, and I want to discuss more about the first question. As we known, the codec will hurt ASR performance, do you think the ASR codec next token prediction loss will help improve TTS results or some emergency ability of speech? How about just adding codec token in vocabulary and it will only output in TTS task. For ASR task, still use after downsample speech encoder feature as input of LLM.

Answer 3 · 2024-03-27T11:19:00.000Z

Thanks for your answer, and I want to discuss more about the first question. As we known, the codec will hurt ASR performance, do you think the ASR codec next token prediction loss will help improve TTS results or some emergency ability of speech? How about just adding codec token in vocabulary and it will only output in TTS task. For ASR task, still use after downsample speech encoder feature as input of LLM.

We believe that using a unified representation is better and has greater potential. However, what you mentioned can also work, for example, LauraGPT (https://arxiv.org/abs/2310.04673).