mlvlab/RPO

Does RPO (read-only mechanism) really work on the language branch?

Closed this issue · 3 comments

  def build_attention_mask(self):
      # lazily create causal attention mask, with full attention between the vision tokens
      # pytorch uses additive attention mask; fill with -inf
      mask = torch.empty(self.context_length, self.context_length)
      mask.fill_(float("-inf"))
      mask.triu_(1)  # zero out the lower diagonal
      return mask

The attn_mask used in CLIP has been ensured that the attention flow from learnable prompts to existing features is impossible: each token only considers the tokens before it, and not the tokens after it. The analysis of (a) Conventional Prompt Tuning (CoCoOp) (b) Linear Probing and (c) Read-only Prompt Optimization is untenable.

The annotation "lazily create causal attention mask, with full attention between the vision tokens" implies that the motivation and idea of RPO "keeps token features frozen and unaffected by introduced prompts" actually only works on the visual branch. The textual tokens are not originally affected by introduced prompts. Is my understanding correct?

If my reasoning is reasonable, how to explain the analysis and experimental result in RPO with uni-modal prompts:

Text-RPO and CoOp differ in the point that RPO’s prompts do not affect the internal representation of the pre-trained model but CoOp’s prompts do. As shown in Table 5, uni-RPO still achieves competitive performance compared to CoCoOp with a 0.8% drop compared to RPO, which again demonstrates the effectiveness of the read-only mechanism.

    def define_mask(self):
        len_max = 77
        attn_head = 8

        text_mask = torch.empty(0, len_max, len_max)
        for idx in self.len_prompts:
            mask = torch.empty(len_max, len_max)
            mask.fill_(float("-inf"))
            mask.triu_(1)  # zero out the lower diagonal
            mask[:, idx:].fill_(float("-inf"))
            text_mask = torch.cat([text_mask, mask.repeat(attn_head, 1, 1)])
        self.text_mask = text_mask

        # image encoder mask
        att_size = 1 + 14 * 14 + self.cfg.TRAINER.RPO.K
        visual_mask = torch.zeros((att_size, att_size), dtype=self.dtype, requires_grad=False)
        visual_mask[:, -1 * self.cfg.TRAINER.RPO.K:] = float("-inf")
        #####

        self.visual_mask = visual_mask

The code of the define_mask method in the rpo.py clearly imitates the build_attention_mask method in CLIP. Did you not notice at that time that the self-attention mechanism in CLIP used a causal attention mask? Although full attention mask is more common, CLIP uses the causal attention mask instead.

Only eot_token can see all tokens in a sentence, which is why CLIP uses the eot_token as the final representation of a sentence. This has strengthened my thoughts even more: CLIP uses the causal attention mask.

Hi, sorry for the late response.

The textual tokens are not originally affected by introduced prompts. Is my understanding correct?
-> If you check the implementation of Context Optimization, the input of the text branch is [prompt 1][prompt 2][prompt k][prompt 4][CLASS NAME]<eos>, hence the [CLASSNAME] is affected by the introduced prompts.

However, in RPO, the input format is:
A photo of a [CLASSNAME] [prompt 1][prompt 2][prompt 3]...[prompt K] and attention from prompt to text tokens is blocked by a read-only mechanism.