Extract the attention weights
MattePalte opened this issue · 3 comments
first of all thanks to @wasiahmad for sharing the code and for the support provided.
I am trying to extract the attention weights from your model (just to observe them and store them on a file), at the moment I am able to get them in the function from_batch of the TranslationBuilder object in this way.
def from_batch(self, translation_batch, src_raw, targets, src_vocabs):
batch_size = len(translation_batch["predictions"])
preds = translation_batch["predictions"]
pred_score = translation_batch["scores"]
attn = translation_batch["attention"]
translations = []
attentions = [] # CHANGE
for b in range(batch_size):
src_vocab = src_vocabs[b] if src_vocabs else None
pred_sents = [self._build_target_tokens(
src_vocab, src_raw[b],
preds[b][n], attn[b][n])
for n in range(self.n_best)]
translation = Translation(targets[b], pred_sents,
attn[b], pred_score[b])
attentions.append(attn[b]) # CHANGE
return translations, attentions # CHANGE
Then I propagate them back to the caller until I can manipulate them in the test.py file and write them on a file.
If there is a better way, I would thank you for sharing it with me.
Reading the paper and from the arguments it seems that there is the copy attention involved, I wanted to know if these weights that I extract are:
- copy attention weights(used to copy tokens directly to the output)
- attention weights used during prediction
- a mix of the two. In this case I would be curious to know how exactly, in particular with reference to the code, because I read the referenced work "Get To The Point: Summarization with Pointer-Generator Networks" (http://arxiv.org/abs/1704.04368 ) and they say: "We recycle the attention distribution to serve as the copy distribution", is this the case for you too?.
In case the extracted one is the copy attention, is it possible to somehow extract an attention that represent the self-attention of the transformer architecture?
Thanks in advance, I wish you a happy and productive day,
It is either 1 or 2. If you are using copy attention, then it is (1), otherwise, it is (2). You can check here.
Yes, you can. You just need to update the code snippet as I referenced above.
Thanks a lot for the fast reply!
I checked and my model has copy_attn set to true, and I was extracting beam_attn, so I was getting the 'copy' attention, right?
I already performed the required modification in order to extract both attentions:
if self.copy_attn:
out = copy_generator.forward(dec_out, attn["copy"], src_map)
out = out.squeeze(1)
# beam x batch_size x tgt_vocab
out = unbottle(out.data)
for b in range(out.size(0)):
for bx in range(out.size(1)):
if blank[bx]:
blank_b = torch.Tensor(blank[bx]).to(code_word_rep)
fill_b = torch.Tensor(fill[bx]).to(code_word_rep)
out[b, bx].index_add_(0, fill_b,
out[b, bx].index_select(0, blank_b))
out[b, bx].index_fill_(0, blank_b, 1e-10)
transformer_attention = unbottle(attn["std"].squeeze(1)) # CHANGE - CONSIDER THIS TRANSFORMER "REGULAR" ATTENTION (?)
beam_attn = unbottle(attn["copy"].squeeze(1)) # CONSIDER THIS COPY ATTENTION (?)
out = generator.forward(dec_out.squeeze(1))
# beam x batch_size x tgt_vocab
out = unbottle(f.softmax(out, dim=1))
# beam x batch_size x tgt_vocab
transformer_attention = None
beam_attn = unbottle(attn["std"].squeeze(1))
I want to know if I can now consider the "std" that I extract as correct "regular" transformer attention and the beam attention (that I was already extracting before) as copy attention, correct?
Thanks in advance.
According to my understanding, yes, you are right.