bytedance/SALMONN

Some questions about this project hoping for your further answers.

Closed this issue · 2 comments

Thank you very much for being able to open source your complete code of SALMONN.
Although you've given a clear presentation in both your paper and code, there are still some points make me puzzled after reading your paper and code:

  1. What's the difference of your training settings between stage1 and stage2 on AST task?
    The paper says you use LibriSpeech-960h (280k) and GigaSpeech M-set at stage1 and afterwards use the same LibriSpeech-960h (also 280k) at stage2, so what's the changes about the training setting on LibriSpeech dataset from stage1 to stage2? Did you train without any instruction during stage1, or just change the used instructions?

  2. How to get the 200k samples of GigaSpeech used at stage2?
    I notice the GigaSpeech used at stage2 is 200k, nearing the number of GigaSpeech S-set (220k), and it seems that you used all the GigaSpeech M-set (680k) during stage1 according to the paper, so what's exactly the 200k samples of GigaSpeech at stage2? Were they randomly selected from GigaSpeech M-set?

  3. Will the performance on downstream tasks not be influenced by so many preset instructions for instruction tuning?
    According to the code recently released, there are many instructions setting for a single downstream task (for instance, there are 15 instructions setting for ASR task). From my point of view, one problem hard to avoid is that some instructions for different downstream tasks may present similar patterns, and these similarities have the potential to mislead the model to another unexpected task during inference, especially with a lower beam setting. So I want to know whether more instructions for tuning is better or less is better in your opinion or during your experiments because I'm uncertain which case may prevent this kind of similarity better.

I failed to find any information about these problems both in paper and code so I'm looking forward to your further answers.
Thank you again for taking time to read my issue. Hoping for your early reply!

Thanks for your interest in our work! As for your questions, I'll answer them below.

  1. Actually, there is not much difference between Stage 1 and Stage 2 settings. It's just that in Stage 1 we focus only on the ASR and AAC tasks and train with a lot of data (regardless of data quality and ratio). In both Stage 1 and Stage 2, we use the same instructions for ASR and AAC.
  2. The 200k samples of GigaSpeech are randomly selected from the M set. The reason for reducing the ASR data is to alleviate the problem of task overfitting, as we have found that once there is too much ASR data, the problem of task overfitting can be very serious.
  3. I suppose more instructions for tuning is better, because we want to get the LLM to learn the subtle differences between instructions of different tasks. In other words, we want the LLM to be able to follow our instructions well.
    Ideally, we can assume that text-based LLMs are able to fully understand user-input prompts, since LLMs can be assumed to have seen a wide variety of instructions when training. For multimodal LLMs, such as SALMONN, it is difficult to cover a variety of multimodal interaction scenarios when training, which leads to the fact that our prompts are easily bound to a limited number of tasks and the model's ability to follow instructions is inhibited. It is in this case that the model may misunderstand the user’s prompt when facing unseen prompts or tasks. However, if we could give as many instructions when training SALMONN as others train text LLMs, covering a variety of multimodal scenarios, the model would recover the ability to understand subtle differences between prompts as text LLMs do. But it would certainly be a daunting task.

I hope my answer solves your confusion. If you still have questions, feel free to ask as well.

Thanks for your interest in our work! As for your questions, I'll answer them below.

  1. Actually, there is not much difference between Stage 1 and Stage 2 settings. It's just that in Stage 1 we focus only on the ASR and AAC tasks and train with a lot of data (regardless of data quality and ratio). In both Stage 1 and Stage 2, we use the same instructions for ASR and AAC.
  2. The 200k samples of GigaSpeech are randomly selected from the M set. The reason for reducing the ASR data is to alleviate the problem of task overfitting, as we have found that once there is too much ASR data, the problem of task overfitting can be very serious.
  3. I suppose more instructions for tuning is better, because we want to get the LLM to learn the subtle differences between instructions of different tasks. In other words, we want the LLM to be able to follow our instructions well.
    Ideally, we can assume that text-based LLMs are able to fully understand user-input prompts, since LLMs can be assumed to have seen a wide variety of instructions when training. For multimodal LLMs, such as SALMONN, it is difficult to cover a variety of multimodal interaction scenarios when training, which leads to the fact that our prompts are easily bound to a limited number of tasks and the model's ability to follow instructions is inhibited. It is in this case that the model may misunderstand the user’s prompt when facing unseen prompts or tasks. However, if we could give as many instructions when training SALMONN as others train text LLMs, covering a variety of multimodal scenarios, the model would recover the ability to understand subtle differences between prompts as text LLMs do. But it would certainly be a daunting task.

I hope my answer solves your confusion. If you still have questions, feel free to ask as well.

Thanks for your explicit response to my questions. I have no more questions now and I'll close the issue.