Switching spectral initialization to sklean.manifold.SpectralEmbeddings
dkobak opened this issue · 14 comments
A student in our lab is currently looking into spectral initialization, and she found out that openTSNE.init.spectral(tol=0)
in some cases does not agree to sklearn.manifold.SpectralEmbedding(affinity='precomputed')
. In some cases it does agree perfectly or near-perfectly, but we have an example when the result is very different, and SpectralEmbedding
gives what seems like a more meaningful result.
I looked at the math, and it seems that they should conceptually be computing the same thing (SpectralEmbedding
finds eigenvectors of the L_sym, whereas init.spectral
finds generalized eigenvectors or W and D, but that should be equivalent, as per https://jlmelville.github.io/smallvis/spectral.html @jlmelville). We don't know what the difference is due to. It may be numerical.
However, conceptually, it seems sensible if openTSNE would simply outsource the computation to sklearn.
A related issue is that init.spectral
is not reproducible and gives different results with each run. Apparently the way we initialize v0
makes ARPACK to still have some randomness. Sklearn gets around this by initializing v0
differently. I guess openTSNE should do the same -- but of course if we end up simply calling SpectralEmbedding
then it won't matter.
@dkobak don't want to hijack this issue so feel free to contact me directly, but I am curious:
- is the example a simulation dataset or have any other unusually regular structure?
- do you know what codepath and solver the spectral embedding ends up using?
- have you been able to look at the eigenvalues that come out? No chance that the solver has accidentally skipped an eigenvalue which just happens to give a nicer looking embedding? The discussion at yixuan/spectra#126 suggests that for theoretical reasons with shift-and-invert (which spectral embedding uses with the codepath that ends up calling
eigsh
), if the shift value matches the eigenvalue then it will be skipped. I can't say that arpack's version suffers from that but might be worth checking. - any noticeable differences in run time?
- assuming you are using the uniform affinity input, does adding a small number of random values (while maintaining symmetry) with small values (e.g. 1e-3) to the affinity matrix before creating the Laplacian have any affect on the result? In my experience, this can stabilize the convergence of a numerically problematic Laplacian without hugely affecting the initial embedding, although this seems to be an issue with UMAP graphs more than t-SNE, presumably due to the former being a lot sparser (also the UMAP tolerance is much looser). Might be worth a try though.
Hi James, thanks for the interest. This is a tiny graph with very pronounced geometric structure: https://sparse.tamu.edu/HB/can_96. The structure is a ring: https://sparse-files-images.engr.tamu.edu/HB/can_96_graph.gif
Here is the entire code to reproduce the issue (download and unpack the Matrix Market data from the link above):
%matplotlib notebook
import numpy as np
import pylab as plt
from scipy.io import mmread
adjacency = mmread('/home/dmitry/Downloads/can_96.mtx').toarray()
affinity = adjacency / np.sum(adjacency, axis=1, keepdims=True)
affinity = (affinity + affinity.T) / 2
affinity /= np.sum(affinity)
from openTSNE.initialization import spectral
Z1 = spectral(affinity, tol=0)
from sklearn.manifold import SpectralEmbedding
Z2 = SpectralEmbedding(affinity='precomputed').fit_transform(affinity)
plt.figure(figsize=(8, 4), layout='constrained')
plt.subplot(121)
plt.title('openTSNE')
plt.scatter(Z1[:,0], Z1[:,1])
plt.subplot(122)
plt.title('sklearn')
plt.scatter(Z2[:,0], Z2[:,1])
plt.savefig('spectral.png', dpi=200)
SpectralEmbedding is run with default parameters so I assume it's using arpack
solver via eigsh
. I have not checked the eigenvalues or anything else yet.
Wow, right after posting the above comment, I found out that the openTSNE's result only arises due to particular v0
:
v0 = np.ones(A.shape[0]) / np.sqrt(A.shape[0])
eigvals, eigvecs = sp.linalg.eigsh(
A, M=D, k=k, tol=tol, maxiter=max_iter, which="LM", v0=v0
)
If I simply remove this v0
, then the result looks identical to the sklearn! So it seems that this v0
choice not only leads to non-reproducible results, but also screws up the eigenvector calculation!!
Sklearn uses this: https://github.com/scikit-learn/scikit-learn/blob/dc580a8ef5ee2a8aea80498388690e2213118efd/sklearn/utils/_arpack.py#L4
v0 = random_state.uniform(-1, 1, size)
so openTSNE should probably do the same.
Or we simply call SpectralEmbedding(affinity='precomputed', eigen_tol=tol).fit_transform(A)
.
More details on this:
D = np.diag(np.ravel(np.sum(affinity, axis=1)))
from scipy.sparse.linalg import eigsh
v0 = np.ones(affinity.shape[0]) / np.sqrt(affinity.shape[0])
eigvals, _ = eigsh(affinity, M=D, k=3, tol=0, which="LM", v0=v0)
print(eigvals)
eigvals, _ = eigsh(affinity, M=D, k=3, tol=0, which="LM")
print(eigvals)
results in
[0.90591439 0.94925302 1. ]
[0.94925302 0.94925302 1. ]
Last comment, promise: using sigma=1.0
in the eigsh
call (following sklearn) speeds up the calculation by 3x (for this particular graph):
%time eigvals, _ = eigsh(affinity, M=D, k=3, tol=0, which="LM")
%time eigvals, _ = eigsh(affinity, M=D, k=3, tol=0, which="LM", sigma=1.0)
So if we keep our own implementation, then using sigma=1.0
may be a good idea.
Thanks for all those details Dmitry. I had a feeling it would be something weird with a very structured graph. It is indeed skipping an eigenvector for some choices of v0
, presumably because the eigenvalues are degenerate?
In this case, the result seems to be exceptionally sensitive to v0
in a way I don't understand. The init.spectral
v0
should be a good guess because the vector of 1s is a (generalized) eigenvector (the one that gets discarded). Using np.ones
directly rather than scaling that vector to length 1 gives the correct result -- I'd be curious to know if that also works for you Dmitry. I am seeing similarly odd behavior when using eigsh
with the symmetrized graph Laplacian directly also.
I have never had good luck with setting sigma
. In my experience with it, in combination with sparse data, it seems to cause the routine to hang forever. Have you tried it with something like MNIST?
Using np.ones directly rather than scaling that vector to length 1 gives the correct result -- I'd be curious to know if that also works for you Dmitry.
Yes it does! Weird. Sklearn does mention here https://github.com/scikit-learn/scikit-learn/blob/dc580a8ef5ee2a8aea80498388690e2213118efd/sklearn/utils/_arpack.py#L7 that the scale of initialization may be important... However, at this point I do not trust this approach and would rather have v0
initialized randomly.
I have never had good luck with setting sigma. In my experience with it, in combination with sparse data, it seems to cause the routine to hang forever. Have you tried it with something like MNIST?
Just tried now, by generating the standard t-SNE affinity matrix for MNIST (so 90 neighbors for each point).
eigsh(affinity, M=D, k=3, tol=0, which="LM")
eigsh(affinity, M=D, k=3, tol=0, which="LM", sigma=1.0)
SpectralEmbedding(affinity='precomputed').fit_transform(affinity)
The first call took 11 s. The other two did indeed hang forever and I stopped both of them after a few minutes.
However, using Pyamg solver
SpectralEmbedding(affinity='precomputed', eigen_solver='amg').fit_transform(affinity)
also took 11 s. This may be due to looser tolerance though, as the docs mention that amg
solver sets its own tolerance which is above 0: https://scikit-learn.org/stable/modules/generated/sklearn.manifold.SpectralEmbedding.html.
Just noticed that setting the tolerance for the ARPACK solver in SpectralEmbedding is only possible in version 1.2.
Okay, so if we don't want to depend on sklearn 1.2, then it seems that maybe the best course of action is simply to kick out v0
and keep everything else as is, including NOT using sigma=1.0
. Thoughts?
On a technical basis, I suspect in most real-world cases, the current choice of v0
is good and it will probably be a small performance pessimization to use a random v0
. I am not sure if many people are going to hit this problem. If they do, they are probably researchers like yourself, so the current workaround (initialize manually and hence have control of the spectral settings) might be ok?
Do you have any evidence that the current choice of v0
is helpful and that a random v0
would lead to slower convergence? I don't see any difference on MNIST affinities.
With initialization.spectral
specifically, no. But with eigsh
on the shifted symmetric graph Laplacian, in general I have found that guessing the lowest eigenvalue usually helps. However, it's not consistent.
It seems to me that if sklearn doesn't have these pitfalls, and produces the same results as our implementation, it would make more sense to just switch over to their implementation. I don't really know any of the details of the numerical solvers and it could be a real hassle to maintain if we find similar bugs to this in the future. What do you think?
It seems to me that if sklearn doesn't have these pitfalls, and produces the same results as our implementation, it would make more sense to just switch over to their implementation. I don't really know any of the details of the numerical solvers and it could be a real hassle to maintain if we find similar bugs to this in the future. What do you think?
That was my original thinking here, however as @jlmelville pointed out, sklearn implementation can be VERY SLOW unless one either sets some non-zero tol
(which is only available from sklearn 1.2, and I have not tested the runtime of this yet), or uses amg
solver (which is only available if pyamg
is installed, and apparently does not allow tol=0
at all, as it sets its own tolerance).
So the question is if you want to depend on either sklearn 1.2 or pyamg.
I also like the simplicity of simply running eigsh(affinity, M=D, k=3, tol=tol, which="LM")
which seems to be much faster than sklearn, allows to use tol=0 or tol>0, and is only one line of code... It was a bit of a pain to figure out what exactly sklearn is computing, mathematically, as it partially goes through scipy etc.
Okay, that's fine by me. I'm not going to pretend I know what this does, but I'm sure you've tested it out thoroughly and I'll leave it to your judgement :) Maybe it would be best to open up a PR with this change so we can get this merged ASAP.
It is part of #225.
But just to clarify, the only edit is that it replaces
v0 = np.ones(A.shape[0]) / np.sqrt(A.shape[0])
with
# v0 initializatoin is taken from sklearn.utils._arpack._init_arpack_v0()
random_state = check_random_state(random_state)
v0 = random_state.uniform(-1, 1, A.shape[0])
before running
eigvals, eigvecs = sp.linalg.eigsh(
A, M=D, k=k, tol=tol, maxiter=max_iter, which="LM", v0=v0
)
So the only edit is in v0
. The way we had it before led to nonreproducible results and other weird behaviour.