facebookresearch/multimodal

CoCa model implementation

Opened this issue ยท 1 comments

๐Ÿš€ The feature, motivation and pitch

Thank you for your awesome works!
I have some questions about CoCa model implementation.

In */multimodal/torchmultimodal/models/coca/coca_model.py, it seems like we can decide whether using CascadedAttentionPooler or just single AttentionPooler.
However, when using CascadedAttentionPooler, dimensions are not matched at the second loop.

For example, after vision feature is extracted from VisionEncoder and its feature has shape of (B, h*w, dim).
It has to pass through vision_pooler layers (pooled_outputs = self.vision_pooler(image_embeddings)) and when using CascadedAttentionPooler, 'self.vision_pooler' class has 2 sequential AttentionPooler layers.
After passed through 1st AttentionPooler layer, feature has shape of (B,256,q_dim) and it doesn't matched with the LayerNorm in the second loop which is supporting 'dim', not 'q_dim'.
Is it okay if I arbitrarily modify the input dimension of the second AttnetionPooler layer?

Similary, when using 'vision_cls_token' with CascadedAttentionPooler, shape of vision feature is (B, h*w + 1(cls), dim) (e.g., B,1025,768).
And at the vision_pooler layer, it return learnable tokens after cross-attention with vision feature and it has (B,256,q_dim) shape for each captioning_image_embeddings, and contrastive_image_embeddings, respectively.
If you intended to not using visual features directly, is it necessary to add 'cls_token' at the initial stage?
I mean, what is the purpose of adding 'cls_token' at the front of visual features even though, you're not using them directly.

Thank you again!

Alternatives

No response

Additional context

No response

Hi @seungkyuK thanks for creating the issue! Sorry for the delayed reply, I missed this one over the holidays.

You are right about (1): we need to change input_embed_dim for the contrastive pooler to match the output dim from the captioning pooler. I will open up a quick PR to fix this.

On (2): this is an interesting case. Actually we went back and forth on whether to include the CLS token in CoCa's vision encoder at all because it is not really clear from the paper that they use it. The open_clip implementation (which we compared against in #507) does use it, but they also only used global average pooling in their original implementation. However, our read of the pseudocode in Figure 2 of the paper was that they do not use CLS. As a result you'll see that most of our models default to vision_include_cls_embed=False. So we actually didn't intend for people to set both vision_include_cls_embed and cascaded_pooler to True (though we did not do a good enough job making this clear).

If you are setting vision_include_cls_embed to True and you want to use the vision encoder's CLS token directly, you can define your own pooler (we already have the CLS pooler defined here, one option is to just concat that with the usual attention pooler for the captioning objective). The alternative is to just use the contrastive pooler to aggregate over all tokens by setting n_queries=1 here (I think we should make this change anyways, it doesn't make sense to return more than one token for the contrastive objective).

Then you are correct that the CLS embedding is no longer used directly (and actually I think this is true regardless of whether we use cascaded or parallel attention poolers). One thing we could do is modify coca_vit to make it easier to define a custom pooler that uses CLS directly (similarly to the one I mentioned above), otherwise it has to be done from the CoCaModel class which is a bit more work.

For now I will at least make the change to fix (1) and set n_queries=1 in the contrastive pooler by default. Please let me know if my discussion of (2) makes sense and whether there's anything we can do to make things clearer on the attention pooler + CLS front. Update: opened #518 for this