A question about Tutorial 6 - Transformers

Question

A question about Tutorial 6 - Transformers

gihanpanapitiya opened this issue a year ago · 2 comments

Hello, I am trying to implement your Tutorial 6 for sequences with different lengths. I am little confused about getting non zero values for the padded positions (where we don't have any 'non-padding' tokens) in the 'value' matrix.

This is the example I use;

num_heads = 2
seq_length=4
batch_size=2
head_dim = embed_dim // num_heads

embed = torch.nn.Embedding(5, embed_dim, 0)
s =  torch.tensor([[1,2,0, 0], [1,2,3,4] ]) # here 0 is the padding token
e = embed(s)

qkv_proj = nn.Linear(embed_dim, 3*embed_dim)
qkv = qkv_proj(e)
qkv = qkv.reshape(batch_size, seq_length, num_heads, 3*head_dim)
qkv = qkv.permute(0, 2, 1, 3) # [batch_size, num_heads, seq_length, heads]
q, k, v = qkv.chunk(3, dim=-1)

d_k = q.size()[-1]
attn_logits = torch.matmul(q, k.transpose(-2, -1))

Now regarding the mask, I created it using two ways. First method;

mask = torch.tensor([[0,0,1,1],[0,0,0,0]], dtype=torch.bool)
mask = mask.unsqueeze(1).unsqueeze(2).to(torch.bool)

attn_logits=attn_logits.masked_fill(mask, float('-inf') )

attention = F.softmax(attn_logits, dim=-1)
values = torch.matmul(attention, v)

values = values.permute(0, 2, 1, 3) # [Batch, SeqLen, Head, Dims]
values = values.reshape(batch_size, seq_length, embed_dim)

With the above, I get following for values matrix;

tensor([[[-0.4799, -0.1019],
         [-0.4799, -0.0683],
         [-0.4800, -0.0904],
         [-0.4800, -0.0904]],

        [[-0.5304,  0.0575],
         [-0.5303,  0.1040],
         [-0.5277,  0.1278],
         [-0.5298,  0.0873]]], grad_fn=<ReshapeAliasBackward0>)

So we have got non zero values for positions where we don't have actual values. That is, we have got [-0.4800, -0.0904], for positions where we had the padding token zero. Is this Ok?

Now I can also define the mask as follows,

mask = torch.tensor([
  [[0,0,1,1], [0,0,1,1], [1,1,1,1], [1,1,1,1]],
  
  [[0,0,0,0], [0,0,0,0],[0,0,0,0], [0,0,0,0]]
], dtype=torch.bool)

mask = expand_mask(mask) # I used your expand_mask function

If I use the above mask, the values matrix I get looks like this,

tensor([[[-0.4799, -0.1019],
        [-0.4799, -0.0683],
        [    nan,     nan],
        [    nan,     nan]],

       [[-0.5304,  0.0575],
        [-0.5303,  0.1040],
        [-0.5277,  0.1278],
        [-0.5298,  0.0873]]], grad_fn=<ReshapeAliasBackward0>)

So, here, I have got nan for positions where we have zero paddings. We can replace these nans by zeros.

What do you think is the correct approach?

Answer 1 · 2023-07-06T13:10:23.000Z

Hi, a padding token is generally masked such that unmasked tokens are not influenced by it. However, since these are tokens we do not need and do not influence the content tokens, we usually don't care about their values.
In your first example, you restrict all tokens in the first batch sequence to only look at the first two values. This includes the padding tokens, meaning that their new value depends on the first two tokens and thus are not zero. In the second case, you mask out everything for the padding tokens, which results in the division by zero.
Generally, which of the two options you choose is up to you, since they don't make a difference in the content tokens. If you prefer a clearer indication of where the padding tokens are throughout the network, you can go with option 2 and prevent NaNs. Otherwise, option 1 is the simpler version to implement.

Answer 2 · 2023-07-06T14:23:16.000Z

Thank you! And Thank you for maintaining these tutorials! They very detailed and really useful.