Why permute NLD to LND shape?
Closed this issue · 2 comments
tau-yihouxiang commented
Why permute x's shape from NLD to LND? This is different from the principle, which means attention in batch channel of each word.
x = x.permute(1, 0, 2) # NLD -> LND
for i in range(self.num_layers):
x = self.transformer[i](x)
x = x.permute(1, 0, 2) # LND -> NLD
MaxxP0 commented
because they dont use MultiHeadAttention with batch_first=True.
tau-yihouxiang commented
Thanks~