kzl/decision-transformer

Potential bug: Attention mask allows access to future tokens?

Closed this issue · 5 comments

Thanks for sharing your code, great work. I do not get one detail here, maybe you can help:

During the batch data generation, you fill the masks with one's where ever valid history trajectory data is available:

mask.append(np.concatenate([np.zeros((1, max_len - tlen)), np.ones((1, tlen))], axis=1))

Isn't it common practice with auto regression to use a triangular matrix? In your training code, you consider all actions where the mask is >0. Doesn't this result in allowing actions early in the sequence to access subsequent tokens?

action_preds = action_preds.reshape(-1, act_dim)[attention_mask.reshape(-1) > 0]

Thanks!

Thanks @donthomasitos, I think you're very right. This looks like a bug from my point of view. I hope we're wrong :)

kzl commented

See the documentation for attention mask here:

"attention_mask (torch.FloatTensor of shape (batch_size, sequence_length), optional) —
Mask to avoid performing attention on padding token indices.
Mask values selected in [0, 1]:
1 for tokens that are not masked,
0 for tokens that are masked."

This is not the same as the triangular matrix used when computing the attention values.

Hello, sorry to dig up an old issue like this - but where exactly is this triangular attention mask applied? We traced the execution path from trainer.py all the way to Attention.forward and could not find any triangular attention masking.

kzl commented

You have to continue going inside the forward of attention :)

The GPT-2 transformer in huggingface implements the causal triangular attention mask there. We don't implement it ourselves.