In iTransformer the timeseries is projected and splited to heads through variates or through time dimension ?

Question

In iTransformer the timeseries is projected and splited to heads through variates or through time dimension ?

Closed this issue 7 months ago · 1 comments

Great paper and a simple idea that actually turned to be great !

But I'm trying to understand one thing - do you project the time series in self-attention along the time dimension and split into heads along time dimension ? Like here ? $$W_i^Q, W_i^K, W_i^V \in \mathbb{R}^{t \times t'}$$

Then you perform self attention along time-patch i for each head ?
$$Q, K, V \in \mathbb{R}^{d \times t}$$
$$Q'_i = Q'W_i^Q, \quad K'_i = K'W_i^K, \quad V'_i = V'W_i^V$$

Then for a single head along variates, for a time-patch $i$ (head $i$):
$$\text{Attention}(Q'_i, K'_i, V'_i) = \text{softmax}\left(\frac{Q'_i {K'_i}^T}{\sqrt{d_k}}\right) V'_i$$

Answer 1 · 2024-08-21T08:49:24.000Z

Yes, we use multi-head in the attention model. However, the time representations, which are split into heads, have been mixed at the beginning embedding, so it does not explicitly keep the order of time patch.