Implement tree-preserving embedding
Opened this issue · 12 comments
Tree-preserving embedding (http://www.icml-2011.org/papers/421_icmlpaper.pdf) seems like a good match for the 2D visualization design goal, but scales as O(n^3) and is thus impractical for real datasets. It's an appealing concept though, since it is guaranteed to preserve cluster structure at all scales, without the heuristic hackiness of many other embedding methods. A Python/Numpy implementation would be useful for tinkering.
Also, a recent advance in the graphics world (http://www.graphics.cornell.edu/~bjw/IRT08Agglom3.pdf) uses a KD-tree to speed up the construction of the cluster tree to nearly linear-time. Could a similar method be used to make this embedding algorithm practical?
Updates:
I took a different (and far simpler) approach to approximating tree-preserving embedding. Just compute a hierarchical cluster tree, extract the pairwise cophenetic distances, and use them as inputs to MDS. (I'll call this "approximate TPE" for conciseness.) Although still O(n^3), it has much smaller leading constants, and it allows us to benefit from faster drop-in replacements for MDS.
Results
Sanity check
As a first sanity check I ran it on a simple synthetic dataset: a mixture of 2 Gaussians:
and got a reasonably cluster-preserving embedding:
Digits
As a benchmark I ran it on ~2000 small handwritten digit images from USPS. They're pretty low-resolution:
Here are the embeddings produced by competitors (copied from Figure 4 of the TPE paper):
If you just run PCA you get:
For comparison, here's the embedding produced by "approximate TPE" using the single-linkage cluster tree.
Oops, accidentally commented before finished typing:
If you run PCA, the embedding looks like this:
The approximate TPE result using the single-linkage cluster tree looks like this:
Which produces significantly more class-homogeneous clusters.
This approach is also flexible in that I can use any method to construct the cluster tree, e.g. any of the inter-cluster distances listed here: http://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.cluster.hierarchy.linkage.html
So I went ahead and ran "approximate TPE" again using all of those distance metrics, with varying results-- see repo. Not sure how to compare them yet.
So, here's a way we can measure the quality of the embeddings:
Since we know the ground-truth class-memberships of each datapoint, we can measure how well the embedding puts datapoints from the same class near each other. We can do this by assigning each point the same label as its closest neighbor in the embedding. The family of "approximate TPE" algorithms I constructed perform very favorably by this metric compared with PCA and Isomap.
The dotted-line is a reasonable upper-bound on performance-- the 1NN classification accuracy on the raw dataset. "Approximate TPE" with median linkage very closely approaches this bound (98.2% accuracy vs. 98.8% accuracy).
Anyway, this approach seems a lot more promising to me now. Thoughts, comments, suggestions, critiques? Can somebody find this in prior art?
Hmm came up with another way to measure / visualize neighborhood preservation performance:
- For each point, list its k-nearest neighbors in the original data-space and in the low-dimensional embedding. Count how many are in common. Repeat this for a range of values of k.
This provides another nice perspective on the kinds of structure these methods can distort / exaggerate. Although the approximate TPE methods preserve local structure very well, they seem to do a poor job of preserving larger scale neighborhood structure. All nonlinear methods plotted exhibit a trade-off: they capture local structure up to a point, and then recall suffers for larger neighborhoods. PCA (and all linear methods, I would suspect), do not share this problem: recall increases monotonically with neighborhood size.
Another kind of interesting thing: if you take a dataset, compute its Euclidean minimum spanning tree (MST), compute all pairwise shortest paths in the tree, and then perform MDS on these distances, you get another cool-looking / class-homogeneous embedding.
1NN-classification accuracy on digits data: 95.8%
and it kind of looks like what SPADE wishes it was doing?
Another simple test case:
Data: well-separated clusters (isotropic Gaussians) centered on the corners of a 3D cube
Algorithms:
- Approximate TPE (single-linkage): preserves cluster structure
- PCA solution obscures cluster structure:
- Manifold-learning solutions are non-sensical, since the data are not sampled from a single smooth manifold:
- Isomap:
- Locally linear embedding:
- t-SNE preserves cluster structure well:
- As does KPCA with RBF kernel and appropriately chosen bandwidth:
As we increase the dimensionality of the cube (also exponentially increasing the number of clusters), limitations of PCA become more severe, t-SNE does a better job at keeping the clusters distinct:
Diversion: what are the pairwise distances between all corners of a hypercube?
The distance matrix for the 4 vertices of a 2D square looks like this:
The distance matrix for the 8 vertices of a 3D cube looks like this:
Animation as we further increase the dimensionality:
This is sort of related to testing, since I wanted a simple "hierarchical benchmark" where some clusters are near each other, some clusters of clusters are near each other, etc., and it turns out that hypercubes exhibit such a hierarchy (a pair of points can share an edge, face, or hypersurface, or be on opposite ends of the object, etc.)
Another view: measure inter-cluster distances in the data vs. the embeddings.
For the 6D example (with 2^6=64 cluster centers):
Direct comparisons:
R^2: -5.41
However, this test case is probably overly challenging: even if we directly optimize the 2D coordinates of 64 points to preserve the 6D inter-cluster distances, we can't do much better:
R^2: -1.98