Extract the attention weights
MattePalte opened this issue · 3 comments
Hi,
first of all thanks to @wasiahmad for sharing the code and for the support provided.
I am trying to extract the attention weights from your model (just to observe them and store them on a file), at the moment I am able to get them in the function from_batch of the TranslationBuilder object in this way.
https://github.com/wasiahmad/NeuralCodeSum/blob/master/c2nl/translator/translation.py#L49
def from_batch(self, translation_batch, src_raw, targets, src_vocabs):
batch_size = len(translation_batch["predictions"])
preds = translation_batch["predictions"]
pred_score = translation_batch["scores"]
attn = translation_batch["attention"]
translations = []
attentions = [] # CHANGE
for b in range(batch_size):
src_vocab = src_vocabs[b] if src_vocabs else None
pred_sents = [self._build_target_tokens(
src_vocab, src_raw[b],
preds[b][n], attn[b][n])
for n in range(self.n_best)]
translation = Translation(targets[b], pred_sents,
attn[b], pred_score[b])
translations.append(translation)
attentions.append(attn[b]) # CHANGE
return translations, attentions # CHANGE
Then I propagate them back to the caller until I can manipulate them in the test.py file and write them on a file.
If there is a better way, I would thank you for sharing it with me.
QUESTION A:
Reading the paper and from the arguments it seems that there is the copy attention involved, I wanted to know if these weights that I extract are:
- copy attention weights(used to copy tokens directly to the output)
- attention weights used during prediction
- a mix of the two. In this case I would be curious to know how exactly, in particular with reference to the code, because I read the referenced work "Get To The Point: Summarization with Pointer-Generator Networks" (http://arxiv.org/abs/1704.04368 ) and they say: "We recycle the attention distribution to serve as the copy distribution", is this the case for you too?.
QUESTION B:
In case the extracted one is the copy attention, is it possible to somehow extract an attention that represent the self-attention of the transformer architecture?
Thanks in advance, I wish you a happy and productive day,
Matteo
QUESTION A
It is either 1 or 2. If you are using copy attention, then it is (1), otherwise, it is (2). You can check here.
QUESTION B
Yes, you can. You just need to update the code snippet as I referenced above.
Thanks a lot for the fast reply!
I checked and my model has copy_attn set to true, and I was extracting beam_attn, so I was getting the 'copy' attention, right?
I already performed the required modification in order to extract both attentions:
if self.copy_attn:
out = copy_generator.forward(dec_out, attn["copy"], src_map)
out = out.squeeze(1)
# beam x batch_size x tgt_vocab
out = unbottle(out.data)
for b in range(out.size(0)):
for bx in range(out.size(1)):
if blank[bx]:
blank_b = torch.Tensor(blank[bx]).to(code_word_rep)
fill_b = torch.Tensor(fill[bx]).to(code_word_rep)
out[b, bx].index_add_(0, fill_b,
out[b, bx].index_select(0, blank_b))
out[b, bx].index_fill_(0, blank_b, 1e-10)
transformer_attention = unbottle(attn["std"].squeeze(1)) # CHANGE - CONSIDER THIS TRANSFORMER "REGULAR" ATTENTION (?)
beam_attn = unbottle(attn["copy"].squeeze(1)) # CONSIDER THIS COPY ATTENTION (?)
else:
out = generator.forward(dec_out.squeeze(1))
# beam x batch_size x tgt_vocab
out = unbottle(f.softmax(out, dim=1))
# beam x batch_size x tgt_vocab
transformer_attention = None
beam_attn = unbottle(attn["std"].squeeze(1))
I want to know if I can now consider the "std" that I extract as correct "regular" transformer attention and the beam attention (that I was already extracting before) as copy attention, correct?
Thanks in advance.
According to my understanding, yes, you are right.