Own pairwise distances
Opened this issue · 3 comments
Would it be possible to add function which computes Barnes-Hut SNE embedings from a precomputed pairwise distance matrix?
When you say "pairwise distance matrix", do you mean a full NxN matrix? That does not make much sense to me: the whole point of Barnes-Hut SNE is that it is O(N log N) in computation and O(N) in memory, whereas a full pairwise distance matrix uses O(N^2) memory.
If you mean a sparse adjacency matrix that is O(|E|) with E the set of edges in the graph: yes, that would be possible and relatively straightforward to achieve.
Thank you for your reply.
Yes, I have a sparse non-symetric NxN matrix (N~=10E4) where each element represents similarity (scaled to <0, 1>) between all the pairs. So plugging this matrix would therefore skip all SNE magic and only performed embedding as force-directed layout algorithms do?
I tried to use this NxN matrix directly into the BH SNE, but performing PCA on this matrix is taking and an eternity and awful amount of memory. I was able to do so only with N~=5000 within reasonable time. Disabling the PCA in tsne.lua won't help – there was an error on line 107: local data_char = ffi.new('char[?]', nchars)
as it could not determine the size of the array or it was too big.
Any hints how to modify source code (or possibly data) to get something out of it?
https://github.com/clementfarabet/manifold/pull/4/files#diff-febf6147a17c3712ede775e407101a27R114
This is where the similarity matrix P is constructed based on the input N
xD
matrix X
. The similarity matrix is stored in row-compressed sparse format: row_P
is a N+1 int
array containing the row indices, col_P
is a int
NxK array containing the column indices, and val_P
a double
NxK array containing the similarity values. The call to computeGaussianPerplexitySparse()
here should thus be replaced by a call to a function that copies the data from your sparse similarity matrix into this row-compressed sparse format.
This would imply making some changes in how the function run()
is called by the FFI wrapper: you would need to adapt it to reflect the changes in your input format. Disable the PCA step altogether, but let run()
and work directly on your sparse similarity matrix.