explosion/spaCy

vector_norm and similarity value incorrect

xuanyiguang opened this issue · 4 comments

Somehow vector_norm is incorrectly calculated.

import spacy
import numpy as np
nlp = spacy.load("en")
# using u"apples" just as an example
apples = nlp.vocab[u"apples"]
print apples.vector_norm
# prints 1.4142135381698608, or sqrt(2)
print np.sqrt(np.dot(apples.vector, apples.vector))
# prints 1.0

Then vector_norm is used in similarity, which always returns a value that is always half of the correct value.

def similarity(self, other):
    if self.vector_norm == 0 or other.vector_norm == 0:
        return 0.0
    return numpy.dot(self.vector, other.vector) / (self.vector_norm * other.vector_norm)

It is OK if the use case is to rank similarity scores for synonyms. But the cosine similarity score itself is incorrect.

Thanks! Will figure this out.

I think this is fixed in 1.0, but this bug makes me uneasy because I don't feel like I really understand what was wrong. I haven't had time to test 0.101.0 yet, but: you say the cosine was always half? I can't figure out why that should be...

What I've come up with is that this calculation looks unreliable:

        for orth, lex_addr in self._by_orth.items():
            lex = <LexemeC*>lex_addr
            if lex.lower < vectors.size():
                lex.vector = vectors[lex.lower]
                for i in range(vec_len):
                    lex.l2_norm += (lex.vector[i] * lex.vector[i])
                lex.l2_norm = math.sqrt(lex.l2_norm)
            else:
                lex.vector = EMPTY_VEC

The lex.l2_norm value is possibly uninitialised, and so there may be a problem there. Passing a 32 bit float to the Python function math.sqrt is also suspicious. But if this was the problem, the results should have been "unreliable, always wrong". Always half?? Unsettling!

Got it now.

The previous default vectors were already normalized. This led to a value of lex.l2_norm = 1 being stored in the lexemes.bin file. This was then read back out into the LexemeC struct when the vocabulary was deserialised.

Later, I added the capability to load custom word vectors, which meant the L2 norm had to be calculated. However, I didn't initialised the value of lex.l2_norm to 0 before computing the new norm. Since the default vectors were normalised, the initial value was always 1, and the eventual norm was sqrt(1+1). This explains why the similarity was consistently half.

No tests checked the exact value returned by the similarity function. They only sanity-checked relative values. This has since been addressed.

lock commented

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.