declare-lab/MM-Align

why stopping at bi-modality?

DiTo97 opened this issue · 5 comments

DiTo97 commented

Hi @Clement25,

Thanks for the interesting read. I discovered your paper while searching for information on the MEL vision-audio-language dataset for multi-modal learning and what approaches had already been investigated.

I was wondering why you limited your research on a bi-modal setting, while the datasets (MEL and CMU-MOSI) are tri-modal. After digging in your paper, I made some assumptions, including:

  • the victim modality assumption would not hold as well in the tri-modal setting;
  • the MulT network that you used as backbone is suitable only for bi-modal sequences;
  • the bi-modal setting is simpler to model.

but I would like to hear your rationale, as the choice is not discussed in the paper.

On a side note, did you investigate the effects of alternating the victim modality during training?

If you did, was it stable? The assumption of a modality being always complete is reasonable, but a more challenging scenario would be to be robust to any missing modality using the others for the alignment recovery.

Hi Federico,

Thanks for your question. Your third point is correct. This work only considers bi-modality conditions for the simplicity of modeling/writing/understanding. We believe our framework can be extended to tri-modality scenarios, but the whole formulation needs to be reconstructed.

Could you elaborate more about

On a side note, did you investigate the effects of alternating the victim modality during training?

Do you refer to the missing rate or modality input?

DiTo97 commented

Could you elaborate more about

On a side note, did you investigate the effects of alternating the victim modality during training?

Do you refer to the missing rate or modality input?

Hi Wei,

Thank you for the clarification on the choice of bi-modality.

As for my question on alternating the victim modality, I was referring to the input structure. Specifically, I was imagining a real-life scenario where a microphone and a camera would constantly stream samples to the model asking for the current emotional state of the people interacting in the conversation. In such a scenario, it is difficult to a priori fix which modality would be complete and which would play the role of the victim, as a de-sync or any other noisy perturbation could happen and prevent samples from any of the two sensors to gather the attention of the model at any time. Therefore, I was curious if you had tried inverting which modality was the victim mid-training, once or more, to investigate how the model would adapt to such a scenario.

Of course, I haven't read the whole code so I don't know if setting the victim modality fixes the configuration of the architecture (i.e., some hidden dims) making it impossible to switch which is the victim mid-training, but I guess this is related to what you mention in the paper about the model being able to either handle full missing or full complete sequences, without being able to react if any sequence had only some of the potential victim modality samples.

tl;dr - have you experimented training the model with (or are you aware of any model that has been trained on) sequences like:

$$ \begin{align*} M_1 &= m_{1, 1}, m_{1, 2}, \ldots, m_{1, n}\\ M_2 &= m_{2, 1}, m_{2, 2}, \ldots, m_{2, n} \end{align*} $$

where $M_1$ and $M_2$ are two aligned sequences of the two modality of choice and any sample $m_{i, j}$ at time $t$ could be missing at random following some known or unknown probability distribution without a fixed victim modality?

DiTo97 commented

Hi @Clement25. Any news?

DiTo97 commented

Thanks for the insights, @Clement25!

I am closing the issue