gpt-omni/mini-omni

Question regarding data format and loss calculation in stage 1

Opened this issue · 7 comments

In stage 1, only ASR and TTS is used.

ASR is Audio -> Text, so loss is only calculated for text tokens, not for audio tokens right?

TTS is Text -> Audio, but mini-omni outputs text and audio simultaneously. I'm not sure how to format input data for TTS.

Input: Text
Output: Text with token + Audio token

When training TTS, both text token and audio tokens are fed into LM, and loss is calculated only for audio tokens? or for TTS, it does not use text (only with pad)?

Only audio tokens loss are calculated when training TTS.

@superFilicos
Were text tokens are fed into transformer in TTS training? also Audio tokens are fed into transformer in ASR training?

we trian these two tasks seperately. I think you can also train them at the same time. they are training different modules of the model.

@superFilicos 请问一下训练TTS adapter用了多大的数据量,有多少个说话人呢?我在做TTS adapter任务的时候也是只考虑了Audio token loss。
我用了多人的数据训练,并添加了spk emb作为LLM模型的输入,结果有大量的重复和漏词的问题,我是在中文上实验的,不知道您当时是不是也有遇到这些问题呢?
谢谢啦

您好,我们训练输出只有一个音色,而且音频数据使用内部工业级模型合成,所以肯定更加稳定。我怀疑是您用的数据bad case率比较高。 我们没有尝试过中文。

vra commented

HI @superFilicos ,感谢解答,下面的一些问题能否在论文中进一步说明呢,以免大家复现时产生疑问不断在issue里面提问:

  1. 3个训练Stage输入的audio token和text token分别时怎么获取的,是否在计算loss中使用了
  2. 3个训练Stage分别计算哪些loss,loss的gt和pred是怎么得到的

@superFilicos Still confusing, what i want to ask is TTS sample format used for training. are GT text token is used for input or filled with text pad tokens?

<audio1> <audio2> ... <audio n>
<text1> <text2> ... <text-pad>

or

<audio1> <audio2> ... <audio n>
 <text-pad>  <text-pad> ... <text-pad>

?