YingfanWang/PaCMAP

Support externally supplied nearest neighbors

Closed this issue · 4 comments

Correct me if I'm wrong, but it looks like users are currently locked into ANNOY for nearest neighbors. This is pretty limiting. Can you support externally supplied nearest neighbors?

While we're on the subject of nearest neighbors, could you provide some context for the n_neighbors_extra strategy? It's not obvious why a constant of 50 should be added. I recall seeing this same thing in the TriMap code. Perhaps this is noted in the paper but you'll understand it's easier to ask about this directly. Besides an explanation here, some code comments would be good.

Further to the above, what about n_neighbors_extra+1?

An edge case related to these questions: it seems it was considered that this strategy may result in requesting more nearest neighbors than there are rows in the dataset based on the line n_neighbors_extra = min(n_neighbors + 50, n). But if we assume this case is possible, doesn't n_neighbors_extra+1 introduce the problem again?

Thank you so much for your interest in our work!

  1. We will support other nearest neighbor algorithm in the future version, probably starting with supporting PyNNDescent used by UMAP v0.5.

  2. The reason that we search for more neighbors than asked is that the distance we are actually using to define neighborhood is the normalized distance (the raw distance between i and j, divided by the average distance of i's 4-6 neighbors to i, and the average distance of j's 4-6 neighbors to j). Therefore the order between neighbors could change, given that ANNOY is searching neighbors using the raw Euclidean distances under default settings. It's possible that some of the points became neighbor even though it's not one of the n_neighbors closest point in the high dimensional space. I will add some comments to that part of code later.

  3. If I remembered correctly, the extra "+1" is due to the fact that the sample itself would count as the very first neighbor, but I will need to go through the implementation to check the rationale behind. I agree that the edge case is not correctly handled.

@hyhuang00 thanks for contributing the research!

I want to clarify that I don't think you should be supporting specific nearest neighbor algorithms. I think of two immediate reasons:

  1. You shouldn't commit to specific algorithms because this is an active area of research and whichever you choose will be inferior to some other choice soon enough.
  2. It forces nearest neighbors to be recalculated for different runs of the algorithm. What if I just want to experiment with different PaCMAP parameters specifically? You would have me recalculate nearest neighbors repeatedly? That can take a long time.

It is conventional for algorithms to support externally supplied nearest neighbor graphs. Algorithms you appear to work with closely such as TriMap and UMAP both support this.

I admit I do not fully understand your explanation for (2). I'll leave the issue with the obvious comment that requiring large numbers of nearest neighbors requires heavier computational/memory resources and concerns me for being able to process large datasets efficiently. It also complicates my request that you support externally supplied nearest neighbor graphs. Think hard about a potential way around this, and if it really makes a significant difference or if the issue is more of an edge case. Can you confirm: this issue isn't specific to ANNOY, right?

Further to the above, I want to note that performance and scalability is a huge need in the field (of dimensionality reduction in general but I'm also talking about cytomics). Every CPU cycle and GB of memory counts. If you can put something out that is notably better in terms of speed and resource usage (while also producing good results, of course) then your potential impact is substantially larger. You're probably already aware of this but I thought I would provide some inspiration just in case 😄

I do commend what appears to be intentional choice of 32 bit data types.

Thank you for your comment and suggestions! We have now supported externally supplied nearest neighbor graphs in version 0.4. You can use pip install --upgrade pacmap to upgrade to the newest version through PyPI. The process is currently a little bit complicated than using externally supplied nearest neighbor graphs in UMAP and TriMap, as we still need the original data for sampling Mid-near and Further pairs (though the NN calculation is unnecessary now) -- in later versions we will try to make it easier. An example can be found in the README file.

I would argue this issue isn't specific to ANNOY since it is not because ANNOY giving non-neighbors due to approximation. Rather, it is because different points in the dataset may have a different "scale" of neighborhood, which means that the neighbors defined by the raw distance could be problematic.

We haven't conduct enough experiments without the extra neighbors to ensure the quality doesn't downgrade under such cases, therefore we keep this setting at this moment. That being said, users do not need to provide these extra neighbors when using externally supplied nearest neighbor graphs.

Again, thank you so much for your comment! It's a great honor for me and the whole PaCMAP team to read these very detailed feedback and comments for our research!

Glad to see this implemented and will be checking it out in the near future 👍