Self-Attention Mask Expansion Issue

Question

Self-Attention Mask Expansion Issue

fanoprcs opened this issue 3 months ago · 0 comments

Assume a mask of [F, F, F, T, T]. In the encoder, this mask is expanded as follows:
slf_attn_mask = mask.unsqueeze(1).expand(-1, max_len, -1)
This results in the following mask:
[F, F, F, T, T]
[F, F, F, T, T]
[F, F, F, T, T]
[F, F, F, T, T]
[F, F, F, T, T]
The expanded mask is then passed into the scaled dot-product attention module. However, I think this may not be correct, as the fourth and fifth words should not be calculating attention at all.

I think the correct version should be:
[F, F, F, T, T]
[F, F, F, T, T]
[F, F, F, T, T]
[T, T, T, T, T]
[T, T, T, T, T]
Could someone clarify if this is an issue or a misunderstanding by me.