maxentile/advanced-ml-project

Implement tree-preserving embedding

Opened this issue · 12 comments

Tree-preserving embedding (http://www.icml-2011.org/papers/421_icmlpaper.pdf) seems like a good match for the 2D visualization design goal, but scales as O(n^3) and is thus impractical for real datasets. It's an appealing concept though, since it is guaranteed to preserve cluster structure at all scales, without the heuristic hackiness of many other embedding methods. A Python/Numpy implementation would be useful for tinkering.

Also, a recent advance in the graphics world (http://www.graphics.cornell.edu/~bjw/IRT08Agglom3.pdf) uses a KD-tree to speed up the construction of the cluster tree to nearly linear-time. Could a similar method be used to make this embedding algorithm practical?

Updates:

I took a different (and far simpler) approach to approximating tree-preserving embedding. Just compute a hierarchical cluster tree, extract the pairwise cophenetic distances, and use them as inputs to MDS. (I'll call this "approximate TPE" for conciseness.) Although still O(n^3), it has much smaller leading constants, and it allows us to benefit from faster drop-in replacements for MDS.

Results
Sanity check
As a first sanity check I ran it on a simple synthetic dataset: a mixture of 2 Gaussians:
image

and got a reasonably cluster-preserving embedding:
image

Digits
As a benchmark I ran it on ~2000 small handwritten digit images from USPS. They're pretty low-resolution:
image

Here are the embeddings produced by competitors (copied from Figure 4 of the TPE paper):
image

If you just run PCA you get:

For comparison, here's the embedding produced by "approximate TPE" using the single-linkage cluster tree.

Oops, accidentally commented before finished typing:

If you run PCA, the embedding looks like this:

image

The approximate TPE result using the single-linkage cluster tree looks like this:
image

Which produces significantly more class-homogeneous clusters.

This approach is also flexible in that I can use any method to construct the cluster tree, e.g. any of the inter-cluster distances listed here: http://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.cluster.hierarchy.linkage.html

So I went ahead and ran "approximate TPE" again using all of those distance metrics, with varying results-- see repo. Not sure how to compare them yet.

So, here's a way we can measure the quality of the embeddings:

Since we know the ground-truth class-memberships of each datapoint, we can measure how well the embedding puts datapoints from the same class near each other. We can do this by assigning each point the same label as its closest neighbor in the embedding. The family of "approximate TPE" algorithms I constructed perform very favorably by this metric compared with PCA and Isomap.

image

The dotted-line is a reasonable upper-bound on performance-- the 1NN classification accuracy on the raw dataset. "Approximate TPE" with median linkage very closely approaches this bound (98.2% accuracy vs. 98.8% accuracy).

Anyway, this approach seems a lot more promising to me now. Thoughts, comments, suggestions, critiques? Can somebody find this in prior art?

Oops, figure above had a black outline around some bars but not others-- fixed:
image

Added Locally Linear Embedding baseline:
image

Hmm came up with another way to measure / visualize neighborhood preservation performance:

  • For each point, list its k-nearest neighbors in the original data-space and in the low-dimensional embedding. Count how many are in common. Repeat this for a range of values of k.

image

This provides another nice perspective on the kinds of structure these methods can distort / exaggerate. Although the approximate TPE methods preserve local structure very well, they seem to do a poor job of preserving larger scale neighborhood structure. All nonlinear methods plotted exhibit a trade-off: they capture local structure up to a point, and then recall suffers for larger neighborhoods. PCA (and all linear methods, I would suspect), do not share this problem: recall increases monotonically with neighborhood size.

Re-did this analysis with all TPE variants, and over a larger range of k (up to k=1000 neighbors) : looks like there's some pretty complicated behavior!
image

Legend:
image

Zoomed in:
image

Zoomed in even further:
image

Another kind of interesting thing: if you take a dataset, compute its Euclidean minimum spanning tree (MST), compute all pairwise shortest paths in the tree, and then perform MDS on these distances, you get another cool-looking / class-homogeneous embedding.

1NN-classification accuracy on digits data: 95.8%

image

and it kind of looks like what SPADE wishes it was doing?

Another simple test case:
Data: well-separated clusters (isotropic Gaussians) centered on the corners of a 3D cube
image
Algorithms:

  • Approximate TPE (single-linkage): preserves cluster structure

image

  • PCA solution obscures cluster structure:

image

  • Manifold-learning solutions are non-sensical, since the data are not sampled from a single smooth manifold:
    • Isomap:

image

  • Locally linear embedding:

image

  • t-SNE preserves cluster structure well:

image

  • As does KPCA with RBF kernel and appropriately chosen bandwidth:

image

As we increase the dimensionality of the cube (also exponentially increasing the number of clusters), limitations of PCA become more severe, t-SNE does a better job at keeping the clusters distinct:
image

Diversion: what are the pairwise distances between all corners of a hypercube?

The distance matrix for the 4 vertices of a 2D square looks like this:
image

The distance matrix for the 8 vertices of a 3D cube looks like this:
image

Animation as we further increase the dimensionality:
blue-hypercube-nolabel-small

This is sort of related to testing, since I wanted a simple "hierarchical benchmark" where some clusters are near each other, some clusters of clusters are near each other, etc., and it turns out that hypercubes exhibit such a hierarchy (a pair of points can share an edge, face, or hypersurface, or be on opposite ends of the object, etc.)

So yeah, it makes a lot more sense for the above comparison to use complete linkage rather than single linkage:
image

Class homogeneity of complete-linkage TPE solution: near-perfect (99.6%-100% 1NN classification accuracy)

Another view: measure inter-cluster distances in the data vs. the embeddings.

For the 6D example (with 2^6=64 cluster centers):
image
image
image
image

Direct comparisons:
image
R^2: -5.41

image
R^2: -14.07

image
R^2: -44.05

However, this test case is probably overly challenging: even if we directly optimize the 2D coordinates of 64 points to preserve the 6D inter-cluster distances, we can't do much better:
image

image

R^2: -1.98