why stopping at bi-modality?

Question

why stopping at bi-modality?

DiTo97 opened this issue a year ago · 5 comments

Thanks for the interesting read. I discovered your paper while searching for information on the MEL vision-audio-language dataset for multi-modal learning and what approaches had already been investigated.

I was wondering why you limited your research on a bi-modal setting, while the datasets (MEL and CMU-MOSI) are tri-modal. After digging in your paper, I made some assumptions, including:

the victim modality assumption would not hold as well in the tri-modal setting;
the MulT network that you used as backbone is suitable only for bi-modal sequences;
the bi-modal setting is simpler to model.

but I would like to hear your rationale, as the choice is not discussed in the paper.

On a side note, did you investigate the effects of alternating the victim modality during training?

If you did, was it stable? The assumption of a modality being always complete is reasonable, but a more challenging scenario would be to be robust to any missing modality using the others for the alignment recovery.

Answer 1 · 2023-10-06T09:02:33.000Z

Hi Federico,

Thanks for your question. Your third point is correct. This work only considers bi-modality conditions for the simplicity of modeling/writing/understanding. We believe our framework can be extended to tri-modality scenarios, but the whole formulation needs to be reconstructed.

Could you elaborate more about

On a side note, did you investigate the effects of alternating the victim modality during training?

Do you refer to the missing rate or modality input?

Answer 2 · 2023-10-06T09:23:55.000Z

Could you elaborate more about
On a side note, did you investigate the effects of alternating the victim modality during training?
Do you refer to the missing rate or modality input?

Hi Wei,

Thank you for the clarification on the choice of bi-modality.

As for my question on alternating the victim modality, I was referring to the input structure. Specifically, I was imagining a real-life scenario where a microphone and a camera would constantly stream samples to the model asking for the current emotional state of the people interacting in the conversation. In such a scenario, it is difficult to a priori fix which modality would be complete and which would play the role of the victim, as a de-sync or any other noisy perturbation could happen and prevent samples from any of the two sensors to gather the attention of the model at any time. Therefore, I was curious if you had tried inverting which modality was the victim mid-training, once or more, to investigate how the model would adapt to such a scenario.

Of course, I haven't read the whole code so I don't know if setting the victim modality fixes the configuration of the architecture (i.e., some hidden dims) making it impossible to switch which is the victim mid-training, but I guess this is related to what you mention in the paper about the model being able to either handle full missing or full complete sequences, without being able to react if any sequence had only some of the potential victim modality samples.

tl;dr - have you experimented training the model with (or are you aware of any model that has been trained on) sequences like:

$$ \begin{align*} M_1 &= m_{1, 1}, m_{1, 2}, \ldots, m_{1, n}\\ M_2 &= m_{2, 1}, m_{2, 2}, \ldots, m_{2, n} \end{align*} $$

where $M_1$ and $M_2$ are two aligned sequences of the two modality of choice and any sample $m_{i, j}$ at time $t$ could be missing at random following some known or unknown probability distribution without a fixed victim modality?

Answer 3 · 2023-10-09T09:25:13.000Z

Hi @Clement25. Any news?

Answer 4 · 2023-10-13T05:53:35.000Z

Hi Federico, Based on your description. I think you need to modify this work at least on the following points: 1. The prediction network should be multi-directional with some switches/triggers, which ensures that there’s always at least one prediction path from the complete modality to the missing modality. 2. Since the data is streamed in, you may have to preset a window of proper size to capture these modality inputs. The window size should not be too small so that the network don’t need to switch its path too often, nor too large, otherwise you can not get a sample of data where one modality is complete and the other is missing to tune the prediction network. I wish my points could provide some inspiration to you.

…

---------------- Best, Wei PhD student in DeClare Lab, SUTD 8 Somapah Rd, 487372 +65 8652 9642 https://clement25.github.io/

On 9 Oct 2023, at 5:25 PM, Federico Minutoli ***@***.***> wrote: Hi @Clement25 <https://github.com/Clement25>. Any news? — Reply to this email directly, view it on GitHub <#4 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AIOWGKRJKWNUMTM7DQU5N3DX6O7IJAVCNFSM6AAAAAA5IM3AQKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTONJSGYZTKNZZGA>. You are receiving this because you were mentioned.

Answer 5 · 2023-10-13T06:26:32.000Z

Thanks for the insights, @Clement25!

I am closing the issue