
Compute text similarity using Word Mover's Distance algorithm (Enhanced)

Primary LanguagePython

Text similarity: WMD_enhanced

Compute text similarity using Word Mover's Distance algorithm (Enhanced)


python 3.6.5
pyemd, numpy, gensim, sklearn, nltk, jieba, pandas, codecs, re

Pretrained word2vec model used in this code

Support both English and Chinese text format
Chinese word2vec CBOW: utf8 2.18G
English word2vec 1.5G

Algorithm description

Enhance the Word Mover's Distance algorithm.

What is Word Mover's Distance algorithm?

See algorithm details in this paper: "From Word Embeddings To Document Distances": http://proceedings.mlr.press/v37/kusnerb15.pdf

What are the shortcomings of WMD algorithm?

1.WMD algorithm is time consuming due to the computation of Euclidean distance between word vectors.
2.WMD algorithm would compute the distance between two documents, which is difficult to compute similarity in [0,1]

What are the enhancements?

WMD_enhanced algorithm compute the Normalized Cosine Distance between word vectors.

  1. Compute cosine similarity between word vectors as res1
  2. Normalize the value of res1 to [0,1] as res2
  3. Normalized Cosine Distance = 1-res2

Example usage

# English text example
docs1 = ['man sitting using tool at a table in his home.',
             'vegetable is being sliced.',
            'a speaker presents some products']
docs2 = ['The president comes to China',
            'someone is slicing a tomato with a knife on a cutting board.',
            'the speaker is introducing the new products on a fair.']
# calculate similarity
sims = wmd_sim('en',docs1,docs2)
# calculate SER
sers = compute_ser(sims)
# print result
for i in range(len(sims)):
    print('Similarity: %.4f' %sims[i])
    print('SER: %.4f' %sers[i])
# Chinese text example
docs1 = ['时间太晚不得就算了', 
docs2 = ['此间贷款不得就算啦', 
# calculate similarity
sims = wmd_sim('cn',docs1,docs2)
# calculate SER
sers = compute_ser(sims)
# print result
for i in range(len(sims)):
    print('Similarity: %.4f' %sims[i])
    print('SER: %.4f' %sers[i])

Example results


man sitting using tool at a table in his home.
The president comes to China
Similarity: 0.1213
SER: 0.8787

vegetable is being sliced.
someone is slicing a tomato with a knife on a cutting board.
Similarity: 0.3555
SER: 0.6445

a speaker presents some products
the speaker is introducing the new products on a fair.
Similarity: 0.4823
SER: 0.5177


Similarity: 0.0903
SER: 0.9097

Similarity: 0.1572
SER: 0.8428

Similarity: 0.4049
SER: 0.5951