Extract the attention weights

Question

Extract the attention weights

MattePalte opened this issue 4 years ago · 3 comments

Hi,
first of all thanks to @wasiahmad for sharing the code and for the support provided.
I am trying to extract the attention weights from your model (just to observe them and store them on a file), at the moment I am able to get them in the function from_batch of the TranslationBuilder object in this way.
https://github.com/wasiahmad/NeuralCodeSum/blob/master/c2nl/translator/translation.py#L49

    def from_batch(self, translation_batch, src_raw, targets, src_vocabs):
        batch_size = len(translation_batch["predictions"])
        preds = translation_batch["predictions"]
        pred_score = translation_batch["scores"]
        attn = translation_batch["attention"]

        translations = []
        attentions = [] # CHANGE
        for b in range(batch_size):
            src_vocab = src_vocabs[b] if src_vocabs else None
            pred_sents = [self._build_target_tokens(
                src_vocab, src_raw[b],
                preds[b][n], attn[b][n])
                for n in range(self.n_best)]
            translation = Translation(targets[b], pred_sents,
                                      attn[b], pred_score[b])
            translations.append(translation)
            attentions.append(attn[b]) # CHANGE

        return translations, attentions # CHANGE

Then I propagate them back to the caller until I can manipulate them in the test.py file and write them on a file.
If there is a better way, I would thank you for sharing it with me.

QUESTION A:
Reading the paper and from the arguments it seems that there is the copy attention involved, I wanted to know if these weights that I extract are:

copy attention weights(used to copy tokens directly to the output)
attention weights used during prediction
a mix of the two. In this case I would be curious to know how exactly, in particular with reference to the code, because I read the referenced work "Get To The Point: Summarization with Pointer-Generator Networks" (http://arxiv.org/abs/1704.04368 ) and they say: "We recycle the attention distribution to serve as the copy distribution", is this the case for you too?.

QUESTION B:
In case the extracted one is the copy attention, is it possible to somehow extract an attention that represent the self-attention of the transformer architecture?

Thanks in advance, I wish you a happy and productive day,

Matteo

Answer 1 · 2021-04-10T14:58:23.000Z

QUESTION A

It is either 1 or 2. If you are using copy attention, then it is (1), otherwise, it is (2). You can check here.

QUESTION B

Yes, you can. You just need to update the code snippet as I referenced above.

Answer 2 · 2021-04-11T15:31:03.000Z

Thanks a lot for the fast reply!

I checked and my model has copy_attn set to true, and I was extracting beam_attn, so I was getting the 'copy' attention, right?

I already performed the required modification in order to extract both attentions:

if self.copy_attn:
      out = copy_generator.forward(dec_out, attn["copy"], src_map)
      out = out.squeeze(1)
      # beam x batch_size x tgt_vocab
      out = unbottle(out.data)
      for b in range(out.size(0)):
          for bx in range(out.size(1)):
              if blank[bx]:
                  blank_b = torch.Tensor(blank[bx]).to(code_word_rep)
                  fill_b = torch.Tensor(fill[bx]).to(code_word_rep)
                  out[b, bx].index_add_(0, fill_b,
                                        out[b, bx].index_select(0, blank_b))
                  out[b, bx].index_fill_(0, blank_b, 1e-10)
      transformer_attention = unbottle(attn["std"].squeeze(1)) # CHANGE - CONSIDER THIS TRANSFORMER "REGULAR" ATTENTION (?)
      beam_attn = unbottle(attn["copy"].squeeze(1))  # CONSIDER THIS COPY ATTENTION (?)
  else:
      out = generator.forward(dec_out.squeeze(1))
      # beam x batch_size x tgt_vocab
      out = unbottle(f.softmax(out, dim=1))
      # beam x batch_size x tgt_vocab
      transformer_attention = None
      beam_attn = unbottle(attn["std"].squeeze(1))

I want to know if I can now consider the "std" that I extract as correct "regular" transformer attention and the beam attention (that I was already extracting before) as copy attention, correct?

Thanks in advance.

Answer 3 · 2021-04-11T16:12:20.000Z

According to my understanding, yes, you are right.