vector_norm and similarity value incorrect

Question

vector_norm and similarity value incorrect

xuanyiguang opened this issue 8 years ago · 4 comments

Somehow vector_norm is incorrectly calculated.

import spacy
import numpy as np
nlp = spacy.load("en")
# using u"apples" just as an example
apples = nlp.vocab[u"apples"]
print apples.vector_norm
# prints 1.4142135381698608, or sqrt(2)
print np.sqrt(np.dot(apples.vector, apples.vector))
# prints 1.0

Then vector_norm is used in similarity, which always returns a value that is always half of the correct value.

def similarity(self, other):
    if self.vector_norm == 0 or other.vector_norm == 0:
        return 0.0
    return numpy.dot(self.vector, other.vector) / (self.vector_norm * other.vector_norm)

It is OK if the use case is to rank similarity scores for synonyms. But the cosine similarity score itself is incorrect.

Answer 1 · 2016-10-11T19:11:44.000Z

Thanks! Will figure this out.

Answer 2 · 2016-10-23T13:03:07.000Z

I think this is fixed in 1.0, but this bug makes me uneasy because I don't feel like I really understand what was wrong. I haven't had time to test 0.101.0 yet, but: you say the cosine was always half? I can't figure out why that should be...

What I've come up with is that this calculation looks unreliable:

        for orth, lex_addr in self._by_orth.items():
            lex = <LexemeC*>lex_addr
            if lex.lower < vectors.size():
                lex.vector = vectors[lex.lower]
                for i in range(vec_len):
                    lex.l2_norm += (lex.vector[i] * lex.vector[i])
                lex.l2_norm = math.sqrt(lex.l2_norm)
            else:
                lex.vector = EMPTY_VEC

The lex.l2_norm value is possibly uninitialised, and so there may be a problem there. Passing a 32 bit float to the Python function math.sqrt is also suspicious. But if this was the problem, the results should have been "unreliable, always wrong". Always half?? Unsettling!

Answer 3 · 2016-10-23T13:28:49.000Z

Got it now.

The previous default vectors were already normalized. This led to a value of lex.l2_norm = 1 being stored in the lexemes.bin file. This was then read back out into the LexemeC struct when the vocabulary was deserialised.

Later, I added the capability to load custom word vectors, which meant the L2 norm had to be calculated. However, I didn't initialised the value of lex.l2_norm to 0 before computing the new norm. Since the default vectors were normalised, the initial value was always 1, and the eventual norm was sqrt(1+1). This explains why the similarity was consistently half.

No tests checked the exact value returned by the similarity function. They only sanity-checked relative values. This has since been addressed.

Answer 4 · 2018-05-09T07:39:15.000Z

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.