maciejkula/glove-python

STS benchmark reproducibility?

Tiriar opened this issue · 2 comments

Hello,

I was wondering if someone managed to reproduce the results of the sentence similarity scores on the STS benchmark dataset (http://ixa2.si.ehu.es/stswiki/index.php/STSbenchmark). I tried to do that using your function transform_paragraph together with tokenizing the sentences with StanfordTokenizer from the NLTK library, but I managed to get to the Pearson coef. of only a bit over 0.3 on the testing set (the STS shows around 0.4).

I know the transform_paragraph function is only experimental, but I was wondering whether you implemented it completely yourself or you used an official GloVe sentence embedding (I myself do not know how exactly they weighted the individual words to get the sentence vector).

Thanks :)

Now I actually managed to get a Pearson coef. of 0.43 by creating the sentence vectors by simply averaging the word vectors of tokens in the sentence.

My bad, I was using a wrong word tokenizer, now I am getting the correct score... Closing the issue ;).

For anyone that would also like to reproduce the STS Benchmark results, simply tokenize the sentences (I used the SpaCy english tokenizer), lower-case the tokens and make an average of the tokens in the sentence.