bytedance/1d-tokenizer

Why permute NLD to LND shape?

Closed this issue · 2 comments

Why permute x's shape from NLD to LND? This is different from the principle, which means attention in batch channel of each word.

x = x.permute(1, 0, 2)  # NLD -> LND
for i in range(self.num_layers):
    x = self.transformer[i](x)
x = x.permute(1, 0, 2)  # LND -> NLD

because they dont use MultiHeadAttention with batch_first=True.

Thanks~