Potential bug: Attention mask allows access to future tokens?
Closed this issue · 5 comments
Thanks for sharing your code, great work. I do not get one detail here, maybe you can help:
During the batch data generation, you fill the masks with one's where ever valid history trajectory data is available:
decision-transformer/gym/experiment.py
Line 154 in f04280e
Isn't it common practice with auto regression to use a triangular matrix? In your training code, you consider all actions where the mask is >0. Doesn't this result in allowing actions early in the sequence to access subsequent tokens?
Thanks!
Thanks @donthomasitos, I think you're very right. This looks like a bug from my point of view. I hope we're wrong :)
See the documentation for attention mask here:
"attention_mask (torch.FloatTensor of shape (batch_size, sequence_length), optional) —
Mask to avoid performing attention on padding token indices.
Mask values selected in [0, 1]:
1 for tokens that are not masked,
0 for tokens that are masked."
This is not the same as the triangular matrix used when computing the attention values.
Hello, sorry to dig up an old issue like this - but where exactly is this triangular attention mask applied? We traced the execution path from trainer.py
all the way to Attention.forward
and could not find any triangular attention masking.
You have to continue going inside the forward of attention :)
The GPT-2 transformer in huggingface implements the causal triangular attention mask there. We don't implement it ourselves.