paraphrase-identification

(Baseline model) Bag of Words to featurize and cosine similarity to calculate similarity. Although it gives a good result on test set with 0.8 F1-score, it might not work well on simple sentences (for eg. I am good vs. I am not good) as Bag of Words doesn't take the order of words into consideration. This can be solved to a certain point by using ngram_range parameter in CountVectorizer (for eg. ngram_range = (1, 3)).
Sentence embedding using MiniLM model with cosine similarity. MiniLM has the best tradeoff between speed and performance.

Another good approach would be to train a classifier model on the dataset after extracting features from the texts. This might not work well as the size of the dataset is not very large (~3k) and the number of features can be very high.
New datapoints can be generated for not-matching texts to augment the dataset.

paraphrase_identification.py - Builds the model on the dataset and calculates F1-Score on test set.
app.py - A Flask web application to provide a simple interface for paraphrase identification.

Steps to run:

First, clone the repository or download as zip. Open the terminal/cmd and get into paraphrase_identification as working directory.

To calculate the F1-Score on test dataset, run python paraphrase_identification.py
To run the Dockerfile, follow the steps: