maxentile/advanced-ml-project

How to compare Isomap vs. PCA?

maxentile opened this issue · 1 comments

So I took a 10k-point subsample of the full Nanog-GFP dataset and embedded it into 2D using PCA and Isomap:

image

image

I'm not sure how to tell which one did a better job-- here are a couple first-pass attempts:

  • Global structure: One way would be to compare reconstruction error (namely: reconstruct the dataset from the low-dimensional embedding, compute two pairwise distance matrices-- one for the reconstruction and one for the data, take the frobenius norm of the difference, and then divide by the number of samples). Since the PCA objective places much more emphasis on preserving the large pairwise distances, PCA achieves lower reconstruction error: 9.3 (PCA) vs. 616.1 (Isomap).
  • Local structure: Reconstruction error may be an unfair metric, since Isomap aims to preserve local distances.
    • One way to measure the preservation of local structure would be to compute the nearest neighbor of every point in the original dataset and count how many times the corresponding point's nearest neighbor in the embedding agrees (to use the notation from earlier today: $\frac{\sum_i {1NN(X_i)=1NN(X'_i) }}{|X|} $ ). Isomap outperforms PCA by this metric: 0.003 (PCA) vs. 0.006 (Isomap).
    • Another way would be to see how many of each point's k-nearest neighbors are in common between the data-space and the embedding, and Isomap also outperforms PCA by this metric (tested for k=5): 0.0124 (PCA) vs. 0.0190 (Isomap).

Neighborhood preservation seems like a good enough measure: #4 (comment)