microsoft/kernel-memory

[Bug] Wrong wording on the Cosine Similarity doc page

Closed this issue · 5 comments

Context / Scenario

Read the document Cosine Similarity.

What happened?

The document Cosine Similarity contains the following text:

Cosine similarity is particularly useful when working with high-dimensional data such as word embeddings because it takes into account both the magnitude and direction of each vector. This makes it more robust than other measures like Euclidean distance, which only considers the magnitude.

The distances of both sentences from truth are huge enough to rewrite them.

Importance

a fix would make my life easier

Platform, Language, Versions

KM Version 0.62.

Relevant log output

No response

dluc commented

Any suggestion about how to improve the text?

I'm not a good English writer. Sorry for wasting time on basic math.

  1. The cosine similarity does not take the magnitude into account. Dot product divided by the product of magnitudes is the cosine of the angle between the two vectors.
  2. Euclidean distance always considers the angle between vectors. Euclidean distance between vectors (2) and (1) is 1 while Euclidean distance between vectors (2) and (-1) is 3. Euclidean distance between two arbitrary radius vectors of a circle is not necessarily zero.
  3. Cosine similarity is often preferred because it compensates for magnitudes.
dluc commented

The sentence about cosine similarity seems ok to me, there's only a typo in the Euclidean part at the end. This change should be sufficient:

Cosine similarity is particularly useful when working with high-dimensional data such as word embeddings because it takes into account both the magnitude and direction of each vector. This makes it more robust than other measures like Euclidean distance, which only considers the magnitude direction.

Given two n-dimensional points A and B

  • A is a point (A1, A2, ..., An)
  • B is a point (B1, B2, ..., Bn)

Cosine similarity:

image

Where:

  • A dot B is the dot product of the vectors A and B
    image

  • |A| is the magnitude of vector A

  • |B| is the magnitude of vector B

so cosine similarity does take into account the magnitude, as mentioned.

Euclidean distance:

image

so Euclidean distance considers only the direction, not the magnitude -- this is the part to fix.

dluc commented

Docs updated

I'm sorry for wasting your time on basic math but it is important.

In a Euclidean space $A·B = |A||B| cosβ$
$\frac{A·B}{|A||B|} = \frac{|A||B| cosβ}{|A||B|} = cosβ$
In fact, $\frac{A·B}{|A||B|}=cosβ$ is a definition of angle β between vectors.

Given five vectors $A=(0,3), B=(0,2), C=(0,1), D=(2,0), E=(\sqrt{2},\sqrt{2})$

image

Magnitudes are
$|A| = 3, |B| = 2, |C| = 1, |D| = 2, |E| = 2$

Euclidean distances ρ are
$ρ(A,B) = |\sqrt{(0-0)²+(3-2)²}| = 1$
$ρ(B,C) = |\sqrt{(0-0)²+(2-1)²}| = 1$
$ρ(A,C) = |\sqrt{(0-0)²+(3-1)²}| = 2$
$ρ(A,D) = |\sqrt{(0-2)²+(3-0)²}| = \sqrt{13}$
$ρ(B,D) = |\sqrt{(0-2)²+(2-0)²}| = \sqrt{8}$
$ρ(C,D) = |\sqrt{(0-2)²+(1-0)²}| = \sqrt{5}$
$ρ(A,E) = |\sqrt{(0-\sqrt{2})²+(3-\sqrt{2})²}| = 2.12$
$ρ(B,E) = |\sqrt{(0-\sqrt{2})²+(2-\sqrt{2})²}| = 1.53$
$ρ(C,E) = |\sqrt{(0-\sqrt{2})²+(1-\sqrt{2})²}| = 1.47$
$ρ(D,E) = |\sqrt{(2-\sqrt{2})²+(0-\sqrt{2})²}| = 1.53$

Cosine similarities σ are
$σ(A,B) = \frac{0×0+3×2}{3×2} = \frac{6}{6} = 1 = cos0$
$σ(B,C) = \frac{0×0+2×1}{2×1} = \frac{2}{2} = 1 = cos0$
$σ(A,C) = \frac{0×0+3×1}{3×1} = \frac{3}{3} = 1 = cos0$
$σ(A,D) = \frac{0×2+3×0}{3×2} = \frac{0}{6} = 0 = cos{\frac{π}{2}}$
$σ(B,D) = \frac{0×2+2×0}{2×2} = \frac{0}{4} = 0 = cos{\frac{π}{2}}$
$σ(C,D) = \frac{0×2+1×0}{1×2} = \frac{0}{2} = 0 = cos{\frac{π}{2}}$
$σ(A,E) = \frac{0×\sqrt{2}+3×\sqrt{2}}{3×2} = \frac{3\sqrt{2}}{6} = \frac{1}{\sqrt{2}} = cos{\frac{π}{4}}$
$σ(B,E) = \frac{0×\sqrt{2}+2×\sqrt{2}}{2×2} = \frac{2\sqrt{2}}{4} = \frac{1}{\sqrt{2}} = cos{\frac{π}{4}}$
$σ(C,E) = \frac{0×\sqrt{2}+1×\sqrt{2}}{1×2} = \frac{\sqrt{2}}{2} = \frac{1}{\sqrt{2}} = cos{\frac{π}{4}}$
$σ(D,E) = \frac{2×\sqrt{2}+0×\sqrt{2}}{2×2} = \frac{2\sqrt{2}}{4} = \frac{1}{\sqrt{2}} = cos{\frac{π}{4}}$

A,B, and C have same direction and different magnitudes. Euclidean distances are different. Cosine similarities equal. Euclidean distance takes into account both direction and magnitude.

Cosine similarity on Wikipedia
Dot product on Wikipedia