YingfanWang/PaCMAP

Enhancement requests. Intermediate manifold intersection and custom distance metrics

rhysnewell opened this issue · 8 comments

Hi there!

I currently use UMAP a lot in my projects and am aware of some its limitations so it's great to see new algorithms popping up that aim to fix some of the shortcomings in the field. UMAP has been around for awhile now and as such the community has made some pretty cool modifications and enhancements to actual algorithm. One of the best features, I think, is the ability to perform intersections and unions on the intermediate fuzzy representations: https://umap-learn.readthedocs.io/en/latest/composing_models.html
This lets the user generate a unified UMAP representation using multiple different distance metrics with ease. I believe that PaCMAP also has an intermediate manifold stage that might make it possible to implement a similar feature, but perhaps I'm wrong on this. Either way, I just wanted to bring it to your attention and perhaps get your thoughts on it.

The other request is something I'm sure is likely to come in future, but that is the ability to easily supply custom distance metrics to PaCMAP much like how the python UMAP handles this: https://umap-learn.readthedocs.io/en/latest/parameters.html?highlight=metric#metric

I realise that these likely non-trivial requests but I hearing what you think about them would be appreciated.

Cheers,
Rhys

Hi Rhys! Thank you for your interest in PaCMAP and we appreciate your comments on enhancement.

Regarding the second request, we do have plans to extend PaCMAP towards more custom distance metrics. The current implementation rely on the L2 distance to select the mid-near pairs, so modifying the PaCMAP will require much time and effort. If there's a particular metric that you are interested in, and maybe we can prioritize that.

Regarding the first request, I think it's cool and very interesting. Are you aware of any use case that may require this feature?

That's great to hear!

Well, I use the intersection method all the time in my tools Lorikeet and Rosella and their submodule flight to help some clustering stuff. But I'm not sure of other use cases

L1 distance, angular distance and Hamming distances are now supported in the newly released v0.5.4. Is there any other distance metric you would like to see to be implemented in PaCMAP?

Hi @hyhuang00

Thanks for adding new metrics to PaCMAP! Could you include Minkowski with p in [0,1]. According to recent publications on curse of dimensionality, metrics like Minkowski with p=0.3 are more suitable compared to Euclidean distance.

Ivan

our group uses the canberra metric quite a bit, and have found it to be the main reason we don't migrate to PaCMAP, even though we enjoy it very much. I tried adding the definition of the metric to the module until I realized reliance on AnnoyIndex keeps me from easily doing so. So here's a next metric suggestion!

Thank you all @ivan-marroquin @antoine-gs for your suggestions. Distance metrics such as general Minkowski distances and Canberra metric are not currently supported by the Annoy Index, which is quite frustrating. One possible workaround I could think of is to precompute the distance matrix and follow the custom distance matrix pair extraction method mentioned in README. I will try to find some alternative packages that can work on these metrics.

Seems like similar packages such as openTSNE are using the nearest neighbor trees from scikit-learn for custom distance metrics. We can also adopt that approach, but I think it is slower than the ANNOY approach.

I am very early in the process of checking this out, but seems like UMAP is moving towards pynndescent which supports mostly same distances a sklearn. I have never used either (annoy and pynndescent) so I don't know how they compare or even if it's even a possible/sensible avenue. Will keep an eye out, thank you!