YingfanWang/PaCMAP

Nearest neighbors on knowledge graph embedding PaCMAP reduction did not work well

Closed this issue · 7 comments

Hi, I applied PaCMAP on pretrained knowledge graph embeddings made from wikidata5m knowledge graph (https://graphvite.io/docs/latest/pretrained_model.html). These are quite chunky and looking up nearest neighbors eats JUST works within my swap limits, but is very slow (so I know it works and the quality of the embedding nearest neighbors is good). I tried RcppHNSW to build an index, but that did not work. So I thought I try PaCMAP. Results were not usable, how can I learn more how to use PaCMAP for this use case? How reliable are nearest neighbors with PaCMAP?

Could you elaborate more on how the PaCMAP results are and why are they non-usable? PaCMAP is not a nearest neighbor finding algorithm -- it relies on the AnnoyIndex to build the nearest neighbor index for constructing the pairs.

Got that, I was not expecting an exact result, looking up nearest neighbors in the dimension reduced embeddings just did not make any sense.
As I continued to poke around the original embeddings, I found some of their nearest neighbors don't make a lot of sense, too, at least for the SimplE embeddings. I used cosine similarity in both cases. Thank you for making this great library!

I used PaCMAP like this before on transformer based embeddings and got usable results (in the sense of being able to cluster the data for topics and to explore the data using the PaCMAP reprojection). But this makes me wonder:

It would be very handy to see PaCMAP applied to more datasets in tutorials, so the user can get a practical feeling for how tweaking the parameters impact on the results and runtime. I'm sold that it can work better than UMAP out of the box, but not always and under all circumstances.
It is a very powerful tool, but I am yet to fully understand how to use it for exploring text embeddings, where UMAP is an established tool.

I just reapplied PaCMAP on numberbatch knowledge graph embeddings.
For the term "keyword" the closest terms in the original embedding space are:
[1] "keyword" "workword"
[3] "keywords" "key_word"
[5] "keyphrase" "reserved_word"
[7] "metaword" "inverted_index"
[9] "meta_tag" "operative_word"
[11] "slashtag" "metatag"
[13] "vigenère_cipher" "search_term"
[15] "keyword_stuffing" "keyness"

Whereas in PaCMAP with default settings (except I initialized random and did not apply PCA before annoy):

[1] "keyword" "photoserigraphy" "imprintest"
[4] "captioner" "cyberlearning" "efficient_pathfinding_test"
[7] "handprinted" "television_news" "middot"
[10] "stylus" "miscategorisation" "semibold"
[13] "inbox_zero" "photoblank" "baskerville"
[16] "verbal_message" "draftswoman" "scratchboard"
[19] "notationally" "post_polio_syndrome" "william_caxton"

Which don't make a lot of sense.

I see your concern here. PaCMAP, while being robust to a bunch of parameters, still depends on the distance metric in the original space. In your case where the original space comes from a graph embedding, I suppose that you may want to use the cosine similarity for PaCMAP as well (it can be easily done by reducer = pacmap.PaCMAP(distance='angular')). Sorry for not mentioning it in the README file, as this is a fairly new feature that has been just added.

Regarding the nearest neighbors, since PaCMAP generally reduces embeddings down to 2D, the lower dimensional space will inevitably lose some local structure in the original space, so sometimes it could be less meaningful, especially if you only look at top-2 neighbors or so.

Thanks a lot for clarifying!

@KnutJaegersberg Have you obtained the results you wanted to see from PaCMAP? Personally I haven't tried using angular distance yet.

nearest neighbor search didnt work yet as originally hoped. Nonetheless thank you! I have a better feeling now for what to use PaCMAP.

 keyword     aristide      laffite      phisher  gulf_arabic    schiavone haussmannian        qatar 

0.000000e+00 1.525113e-12 2.602030e-12 1.267253e-11 3.162060e-11 1.385767e-10 3.707257e-10 5.530758e-10
oubliette yossarian
8.295673e-10 9.479953e-10