Note that $L_T \cdot X$ for any vector $X$ is essentially equivalent to $\text{cumsum} = (\sum\limits_{k=0}^{0} X_k, ..., \sum\limits_{k=0}^{T-2} X_k, \sum\limits_{k=0}^{T-1} X_k)^T$. Moreover, a product $U_T \cdot X$ can be expressed as "reversed cumsum", i.e. $\text{rev cumsum} = (\sum\limits_{k=T-1}^{T-1} X_k, ..., \sum\limits_{k=T-1}^{T-2} X_k, \sum\limits_{k=T-1}^{0} X_k)^T$
3.1 Reformulation
Knowing that $Y_t = A_{t - 1} * Y_{t - 1} + X_t$ we can substitute $Y_{t - 1}$ and get
Unfortunately, the product torch.bmm(M, X) requires $O(T^2)$ operations to compute, so we have to calculate it in a more efficient way. Let's first consider the matrix $Z$.
3.2 $Z$ matrix
First, note the first row of $M$ is composed of only $0$ so lets consider matrix $Z$ such that:
$Z$ can be expressed more compactly using matrix notation, using a binary lower triangular matrix $L_T$ introduced earlier and a matrix $\overline{Z}$ as follows:
By def $Z_{i,j} = A_i * A_{i-1} * ... A_{j}$. Then $Z_{i, j} = exp(\sum\limits_{k=j}^i \ln(A_k))$. Note that since calculations can be performed over complex variables the fact that $A$ might have negative values is not a problem, we just need to compute torch.log in the complex space, then $\log(-|x|)$ can be computed using $|x| \cdot e^{i \cdot \phi} = |x| \cdot \cos(\phi) + |x| \cdot i \sin(\phi)= -|x|$, so $\log(-|x|) = \log(|x|) + i \pi$.
Now let's express $\overline{Z}$ as a matrix-vector product, $\forall n$:
It turns out this product can be computed efficiently using either FFT or CumSum. Turns out we can extend matrix $D[n, :, :]$ to make it Circulant and leveraging FFT we can multiply circulant matrix by vector in just $O(T \log (T))$ operation, there is an excellent paper about Circulant matices. Another approach is to leverage distributed CumSum. However, even though $Z$ can be computed efficiently, it still requires $O(N^2)$ to store the tensor in memory and $O(N^2)$ to calculate the product of this tensor with $X$. So in the next section, we will get rid of torch.bmm(M, X) entirely.
As we said earlier, the product of any vector on the matrix $U_T$ can be computed efficiently using either FFT or CumSum. If we denote A_log.cumsum(dim=-1) as CS, then Z[n, T - 1, :] = CS[:, -1].unsqueeze(-1) + A_log - CS. Now let's note the following fact, for any $t$:
Z[n, t - 2, :] = (Z[n, t - 1, :] - ln(A[n, t - 1])) * torch.cat([torch.ones(t), torch.zeros(T-t)])
So this leads to the conclusion that knowing $Z[n, T - 1, :]$ we can easily calculate $Z[n, t, :]$ for any $t$ by gradually subtracting $\ln(A[n, k])$ and using a proper mask torch.cat([torch.ones(t), torch.zeros(T-t)]). Let's now rewrite this in a tensor form. Lets denote $Z[n, T - 1, :]$ as $V$
$$
Y = X + \text{exp}(Z) \cdot X = X + (L_T * \text{exp}(\overline{Z})) \cdot X
$$
Moreover, we know that we can compute the product of $U_T$ with any vector efficiently, so both $V$ and $W$ can be computed in a fast way.
3.4.2 $V$ and $W$
We know that both $V$ and $W$ can be calculated efficiently, the only thing remaining is to compute $Y$ efficiently. Knowing that both $V$ and $W$ have shape $N \times T$ we can rewrite the equating the following way:
Y = X + (L_T * torch.exp(V.unsqueeze(1) - W.unsqueeze(2))) @ X
The final step is to use the following trick. For any vector $a$ and matrix $B$: $a^T * B = \text{diag}(a) \cdot B$, and $B * a = B \cdot \text{diag}(a)$, so we get a very elegant formula for $Y$:
$$
Y = X + \text{diag}(e^{V}) \cdot L_T \cdot \text{diag}(e^{W}) \cdot X = X + (e^{V})^T * (L_T \cdot (e^{W} * X))
$$
Or using the CumSum operation:
$$
Y = X + (e^{V})^T * (cumsum(e^{W} * X, dim=1))
$$
$L_T$, as well as $U_T$, can be multiplied by vector efficiently, so in this formulation, by varying the algorithm to compute CumSum we can achieve different performances. In the next few sections, we explain different CumSum algorithms, their tradeoffs, and their complexities.
Turns out we can extend matrix $U_T$ and $L_T$ to make them Circulant, and leveraging FFT we can multiply circulant matrix by vector in just $O(T \log (T))$ operation, there is an excellent paper about Circulant matices. See more details in file FFT.md
4. Code
To run the code pytorch and tqdm are required. The recommended way is to use docker, the image can be built running:
docker build -it scan .
5. Benchmarking
Here we compare cusum implementation with the one proposed by François Fleuret. Benchmarking conducted with NVIDIA V100, see Dockerfile for the environment details, and benchmark.py.