GAIR-NLP/anole

Causal mask in attention for text and image data

Closed this issue · 1 comments

Hi thank you for the great work!

I was just wondering did you use causal mask for all modals (text and image) during training (and including interleaved texts and images)? Or did you use bi-directional attention for images (no attention mask, like BERT)?

Yes, the default attention implementation of Chameleon is causal attention. Anole follow the same implementation.

By the way, recent papers (1, 2) also show that bidirectional attention performs well on vision modeling.