Potential bug: Attention mask allows access to future tokens?

Thanks for sharing your code, great work. I do not get one detail here, maybe you can help:

During the batch data generation, you fill the masks with one's where ever valid history trajectory data is available:

Line 154 in f04280e

    
           mask.append(np.concatenate([np.zeros((1, max_len - tlen)), np.ones((1, tlen))], axis=1))

Isn't it common practice with auto regression to use a triangular matrix? In your training code, you consider all actions where the mask is >0. Doesn't this result in allowing actions early in the sequence to access subsequent tokens?

decision-transformer/gym/decision_transformer/training/seq_trainer.py

Line 18 in 5605e40

    
           action_preds = action_preds.reshape(-1, act_dim)[attention_mask.reshape(-1) > 0]

Thanks!

Thanks @donthomasitos, I think you're very right. This looks like a bug from my point of view. I hope we're wrong :)

See the documentation for attention mask here:

"attention_mask (torch.FloatTensor of shape (batch_size, sequence_length), optional) —
Mask to avoid performing attention on padding token indices.
Mask values selected in [0, 1]:
1 for tokens that are not masked,
0 for tokens that are masked."

This is not the same as the triangular matrix used when computing the attention values.

Hello, sorry to dig up an old issue like this - but where exactly is this triangular attention mask applied? We traced the execution path from trainer.py all the way to Attention.forward and could not find any triangular attention masking.

You have to continue going inside the forward of attention :)

The GPT-2 transformer in huggingface implements the causal triangular attention mask there. We don't implement it ourselves.

We followed the code all the way to the actual MHA computations. At any rate, it turns out that the d3rlpy implementation used in https://github.com/takuseno/d3rlpy/blob/master/reproductions/offline/decision_transformer.py does causal attention (using torch.tril). Sincerely - LudvigOn 8 May 2024, at 05:23, Kevin Lu ***@***.***> wrote: You have to continue going inside the forward of attention :) The GPT-2 transformer in huggingface implements the causal triangular attention mask there. We don't implement it ourselves. —Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you commented.Message ID: ***@***.***>