A question about Tutorial 6 - Transformers
gihanpanapitiya opened this issue · 2 comments
Hello, I am trying to implement your Tutorial 6 for sequences with different lengths. I am little confused about getting non zero values for the padded positions (where we don't have any 'non-padding' tokens) in the 'value' matrix.
This is the example I use;
num_heads = 2
seq_length=4
batch_size=2
head_dim = embed_dim // num_heads
embed = torch.nn.Embedding(5, embed_dim, 0)
s = torch.tensor([[1,2,0, 0], [1,2,3,4] ]) # here 0 is the padding token
e = embed(s)
qkv_proj = nn.Linear(embed_dim, 3*embed_dim)
qkv = qkv_proj(e)
qkv = qkv.reshape(batch_size, seq_length, num_heads, 3*head_dim)
qkv = qkv.permute(0, 2, 1, 3) # [batch_size, num_heads, seq_length, heads]
q, k, v = qkv.chunk(3, dim=-1)
d_k = q.size()[-1]
attn_logits = torch.matmul(q, k.transpose(-2, -1))
Now regarding the mask, I created it using two ways. First method;
mask = torch.tensor([[0,0,1,1],[0,0,0,0]], dtype=torch.bool)
mask = mask.unsqueeze(1).unsqueeze(2).to(torch.bool)
attn_logits=attn_logits.masked_fill(mask, float('-inf') )
attention = F.softmax(attn_logits, dim=-1)
values = torch.matmul(attention, v)
values = values.permute(0, 2, 1, 3) # [Batch, SeqLen, Head, Dims]
values = values.reshape(batch_size, seq_length, embed_dim)
With the above, I get following for values
matrix;
tensor([[[-0.4799, -0.1019],
[-0.4799, -0.0683],
[-0.4800, -0.0904],
[-0.4800, -0.0904]],
[[-0.5304, 0.0575],
[-0.5303, 0.1040],
[-0.5277, 0.1278],
[-0.5298, 0.0873]]], grad_fn=<ReshapeAliasBackward0>)
So we have got non zero values for positions where we don't have actual values. That is, we have got [-0.4800, -0.0904]
, for positions where we had the padding token zero. Is this Ok?
Now I can also define the mask as follows,
mask = torch.tensor([
[[0,0,1,1], [0,0,1,1], [1,1,1,1], [1,1,1,1]],
[[0,0,0,0], [0,0,0,0],[0,0,0,0], [0,0,0,0]]
], dtype=torch.bool)
mask = expand_mask(mask) # I used your expand_mask function
If I use the above mask, the values
matrix I get looks like this,
tensor([[[-0.4799, -0.1019],
[-0.4799, -0.0683],
[ nan, nan],
[ nan, nan]],
[[-0.5304, 0.0575],
[-0.5303, 0.1040],
[-0.5277, 0.1278],
[-0.5298, 0.0873]]], grad_fn=<ReshapeAliasBackward0>)
So, here, I have got nan for positions where we have zero paddings. We can replace these nans by zeros.
What do you think is the correct approach?
Hi, a padding token is generally masked such that unmasked tokens are not influenced by it. However, since these are tokens we do not need and do not influence the content tokens, we usually don't care about their values.
In your first example, you restrict all tokens in the first batch sequence to only look at the first two values. This includes the padding tokens, meaning that their new value depends on the first two tokens and thus are not zero. In the second case, you mask out everything for the padding tokens, which results in the division by zero.
Generally, which of the two options you choose is up to you, since they don't make a difference in the content tokens. If you prefer a clearer indication of where the padding tokens are throughout the network, you can go with option 2 and prevent NaNs. Otherwise, option 1 is the simpler version to implement.
Thank you! And Thank you for maintaining these tutorials! They very detailed and really useful.