Why permute NLD to LND shape?

Question

Closed this issue 4 months ago · 2 comments

Why permute x's shape from NLD to LND? This is different from the principle, which means attention in batch channel of each word.

x = x.permute(1, 0, 2)  # NLD -> LND
for i in range(self.num_layers):
    x = self.transformer[i](x)
x = x.permute(1, 0, 2)  # LND -> NLD

Thanks~

Answer 1 · 2024-06-26T08:34:59.000Z

because they dont use MultiHeadAttention with batch_first=True.