Differentiable reparameterization of matrices with orthogonal columns.
Jupyter NotebookMIT
This is a PyTorch implementation of two ways of achieving differentiable reparameterization of matrices with orthogonal columns.
I developed this small side project while trying to implement Orthogonal and Householder Sylvester flows.
"DIOR" (cool name, right?) stands for "DIfferentiable ORthogonalization".
Method 1: iterative
The iterative procedure:
$$Q^{(k+1)} = Q^{(k)} (I + \frac{1}{2} (I - Q^{(k) T} Q^{(k)}))$$
where $Q_0$ is some arbitrary matrix with unit Frobenius norm. $Q^{(k)}$ approaches an orthogonal matrix as $k \rightarrow \infty$, but we can terminate it early when it's good enough.
Let $D$ and $M$ denote the number of rows and columns of $Q_0$ respectively. This method only works for tall matrices ($D>M$) with orthogonal columns. It can also work for $D=M$ but, empirically, convergence is not guaranteed.
Experiment 1: reconstructing a 3-by-2 matrix with orthogonal columns
where $\mathbf{v}$ is the parameter of the transformation.
In the Sylvester flows paper, there was this quote: "It can be shown that any $M \times M$ orthogonal matrix can be written as the product of $M-1$ Householder transformations." Unfortunately, I think it is wrong.
@article{uhlig2001constructive,
title={Constructive ways for generating (generalized) real orthogonal matrices as products of (generalized) symmetries},
author={Uhlig, Frank},
journal={Linear Algebra and its Applications},
volume={332},
pages={459--467},
year={2001},
publisher={Elsevier}
}
This paper above contains a very important theorem on decomposing orthogonal matrices into Householder transformations.
Theorem 2. Every real orthogonal $n \times n$ matrix $U$ is the product of $n − m$ real orthogonal Householder matrices for $m = dim ( ker( U − I_n )) $.
Clearly, $m$ could be anywhere between $0$ and $n$ (if $U = I_n$) inclusive. If $m=0$, then $U$ would require $n$ Householder transformations, not $n-1$. Sure, the Sylvester flow paper was off by one, but what's the big deal?
Let's assume $n-m < n -1$. Such a $U$ would require fewer than $n-1$ Householder transformations according to the theorem. Sure, but what would be the harm of using $n-1$ transformations (more than required)? Well, considering the fact that an odd number of Householder transformations cannot form the identity matrix, the difference between $n-m$ and $n-1$ must be an even number for $n-1$ transformations to work.
Below are results from experiments in which I tried to "fit" a random orthogonal matrix ($n=64$) using a composition of chained Householder transformations by gradient descent on the mean squared error. I reported the mean absolute error on the vertical axes for better interpretation. Pay attention to the wobbling behavior to the right of the plots. Interestingly, a random orthogonal matrix with $n=64$ is equally likely to have $n-m=63$ or $n-m=64$ (see the notebook for the plot), so this is in no way a contrived experiment.