Semantic Similarity Ranking

A simple implementation of ranking for search based systems using semantic similarity.

Dataset

https://ciir.cs.umass.edu/downloads/WebAP/

Writeup

Note: A more detailed writeup will be added soon

Acquired the dataset through Slack.
Pre-processed the dataset
1. Removed Stop-words
2. Lemmatizated the corpus and saved for future reference
3. Creation of Inverted-Index (demonstation purposes)
Converted corpus to vectors using Word2Vec

Tested the semantic similarity on random query words using the model,

Most similar word examples to the query

modelW2V.wv.similarity('cancer', 'tumor') 
#0.8035345

modelW2V.wv.similarity('cancer','ovarian')
#0.860453

Least similar word examples to the query

modelW2V.wv.similarity('cancer', 'cloud') 
#0.8035345

Converted corpus to vectors using Doc2Vec

Found most similary documents given a query

new_sentence = "i love dogs".split(" ") 
# *query = {i,love,dogs}*

model.docvecs.most_similar(positive=[model.infer_vector(new_sentence)],topn=5)
# *selecting the top n documents*

#Result
#[('5235', 0.7422172427177429),
#('4870', 0.7328481674194336),
#('95', 0.7185875773429871),
#('5868', 0.7118589878082275),
#('1954', 0.6987151503562927)]

# *Format = {'DocID','Accuracy of the document with the query'}*

Cheers!

Jhex-AI/Semantic-Similarity-Ranking-v.1

Semantic Similarity Ranking

Dataset

Writeup