This repo contains various ways to calculate the similarity between source and target sentences. You can choose the pre-trained models you want to use such as ELMo, BERT and Universal Sentence Encoder (USE).
And you can also choose the method to be used to get the similarity:
1. Cosine similarity
2. Manhattan distance
3. Euclidean distance
4. Angular distance
5. Inner product
6. TS-SS score
7. Pairwise-cosine similarity
8. Pairwise-cosine similarity + IDF
You can experiment with (The number of models) x (The number of methods) combinations!
- After cloning this repository, you can simply install all the dependent libraries described in
requirements.txt
withpip install -r requirements.txt
.
git clone https://github.com/Huffon/sentence-similarity.git
cd sentence-similarity
pip install -r requirements.txt
- To test your sentences, you should fill out
corpus.txt
with sentences as below.
I ate an apple.
I went to the Apple.
I ate an orange.
...
- Then, choose the model and method to be used to calculate the similarity between source and target sentences.
python sensim.py
--model MODEL_NAME
--method METHOD_NAME
--verbose LOG_OPTION (bool)
- In the following section, you can see the result of
sentence-similarity
. - As you guys know, there is a no silver-bullet which can calculate perfect similarity between sentences. You should conduct various experiments with your dataset.
- Caution:
TS-SS score
might not fit with short-sentence similarity task, since this method originally devised to calculate the similarity between documents.
- Caution:
- Result:
- Python version should be higher than 3.6.x
- You should install PyTorch via official Installation guide
- To use
spaCy
model which is used to tokenize input sentence, download English model by runningpython -m spacy download en_core_web_sm
.
allennlp==0.9.0
bert-score==0.2.1
numpy==1.17.3
scikit-learn==0.21.3
scipy==1.3.1
seaborn==0.9.0
sentence-transformers==0.2.3
spacy==2.1.9
tensorflow==1.15.0
tensorflow-hub==0.7.0
torch==1.3.0
- Upgrade TF to TF2.0 to use
USE 3
- Add pairwise cosine similarity method in
use_elmo
. - Add
InferSent
,Sent2Vec
, plainGloVe
as models.
- Universal Sentence Encoder
- Deep contextualized word representations
- BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
- Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks
- A Hybrid Geometric Approach for Measuring Similarity Level Among Documents and Document Clustering