clementfarabet/manifold

Own pairwise distances

Opened this issue · 3 comments

Would it be possible to add function which computes Barnes-Hut SNE embedings from a precomputed pairwise distance matrix?

When you say "pairwise distance matrix", do you mean a full NxN matrix? That does not make much sense to me: the whole point of Barnes-Hut SNE is that it is O(N log N) in computation and O(N) in memory, whereas a full pairwise distance matrix uses O(N^2) memory.

If you mean a sparse adjacency matrix that is O(|E|) with E the set of edges in the graph: yes, that would be possible and relatively straightforward to achieve.

Thank you for your reply.

Yes, I have a sparse non-symetric NxN matrix (N~=10E4) where each element represents similarity (scaled to <0, 1>) between all the pairs. So plugging this matrix would therefore skip all SNE magic and only performed embedding as force-directed layout algorithms do?

I tried to use this NxN matrix directly into the BH SNE, but performing PCA on this matrix is taking and an eternity and awful amount of memory. I was able to do so only with N~=5000 within reasonable time. Disabling the PCA in tsne.lua won't help – there was an error on line 107: local data_char = ffi.new('char[?]', nchars) as it could not determine the size of the array or it was too big.

Any hints how to modify source code (or possibly data) to get something out of it?

https://github.com/clementfarabet/manifold/pull/4/files#diff-febf6147a17c3712ede775e407101a27R114

This is where the similarity matrix P is constructed based on the input NxD matrix X. The similarity matrix is stored in row-compressed sparse format: row_P is a N+1 int array containing the row indices, col_P is a int NxK array containing the column indices, and val_P a double NxK array containing the similarity values. The call to computeGaussianPerplexitySparse() here should thus be replaced by a call to a function that copies the data from your sparse similarity matrix into this row-compressed sparse format.

This would imply making some changes in how the function run() is called by the FFI wrapper: you would need to adapt it to reflect the changes in your input format. Disable the PCA step altogether, but let run() and work directly on your sparse similarity matrix.