koaning/whatlies

Warning for axis_metric

koaning opened this issue · 5 comments

Let's set up a small embeddingset.

from whatlies.language import SpacyLanguage
from whatlies.transformers import Pca

words = ["prince", "princess", "nurse", "doctor", "banker", "man", "woman",
         "cousin", "neice", "king", "queen", "dude", "guy", "gal", "fire",
         "dog", "cat", "mouse", "red", "bluee", "green", "yellow", "water",
         "person", "family", "brother", "sister"]

lang = SpacyLanguage("en_core_web_md")
emb = lang[words]

Let's now make some charts.

emb.transform(Pca(2)).plot(kind='scatter', annot=True)
emb.transform(Pca(2)).plot(kind='scatter', annot=True, axis_metric='cosine')
emb.transform(Pca(2)).plot(kind='scatter', annot=True, axis_metric='euclidean')

At the moment, they all seem to return this chart.

image

It took me a while to realise that the reason why this was not working as expected was because of the fact that I'm not using "king" as an x_axis/y_axis. Maybe we should introduce a warning here. Might be more user friendly.

mkaze commented

Maybe we should introduce a warning here. Might be more user friendly.

Well, I have mentioned this in the docstrings that custom metric is only effective when the axis is a string or an Embedding (hence it will be ignored when integer axis is provided, which is the default):

https://github.com/RasaHQ/whatlies/blob/6811b7a24b71f169c5b97a0f8a17572f07f96f94/whatlies/embeddingset.py#L949-L950

But we can also add a warning for it if you consider it to be useful.

It's certainly in the docs, and this was certainly a moment of "lack of attention" on my part, but I can imagine that these small changes can help beginning users. I'll add a small warning.

mkaze commented

Further, I would like to resolve a recurring.... I don't know what to call it because I am sure you are fully aware of this, but let's call it a "confusion": the pca_0, pca_1 and alike were only unit indicator vectors which helped with keeping the plot API consistent and make its implementation and usage easier (of course, before introducing integer axis support); hence, they should not be confused with the principal component vectors or alike or think that they were representations of principal components (which is not correct at all). In other words, they actually encoded no useful information and were not essential to exist, and that's why I would call them "helper vectors" (in the same sense as "helper functions").

Good to point out. I'll try to refer to the phenomenon as "helper vectors" from here on.

I'm closing issues because ever since the project moved to my personal account it's been more into maintenance mode than a "active work" mode.