/word-sense-disambiguation

NLP: Word Sense Disambiguation (WSD) 📚 on python 3 🐍.

Primary LanguageJupyter NotebookMIT LicenseMIT

Word Sense Disambiguation

Anish Sachdeva (DTU/2K16/MC/13)

Natural Language Processing (Dr. Seba Susan)

📘 Path Length Similarity | 📘 Resnik Similarity | 📗 Naïve Disambiguation | 📗 Simple LESK Algorithm | ✒ Report

booster

Overview

Introduction

We explore 4 different metrics to compare similarity and disambiguate words. For the 4 different methods refer to the Notebooks Below:

Notebooks

  1. Naive Disambiguation
  2. Simple LESK Algorithm Disambiguation
  3. Path Length Similarity Metric
  4. Resnik Similarity Metric

Naïve Disambiguation

To see the disambiguation of any given word using the naive method, pull this repository on your machine and install all dependencies.

git clone https://github.com/anishLearnsToCode/word-sense-disambiguation.git
pip install -r requirements.txt

navigate to the naive_method.py file and run it and enter a word of your choice:

cd src
python naive_method.py
>> Enter word for disambiguation:    bank
>> Definition: a large natural stream of water (larger than a creek)
>> Examples:
>> ['they pulled the canoe up on the bank',
>> 'he sat on the bank of the river and watched the currents']

See a running example with explanation in this notebook

Simple LESK Similarity Disambiguation

In the Simple LESK Algorithm we use the words present in the gloss surrounding the main token to disambiguate it's meaning and we assign Inverse Document Frequency (IDF) values and assign weights to all possible senses of the given token.

To run locally, clone the repository and install dependencies

git clone https://github.com/anishLearnsToCode/word-sense-disambiguation.git
pip install -r requirements.txt

Navigate to simple_lesk_algorithm.py file and test with sample gloss and word token

cd src
python simple_lesk_algorithm.py
>> Enter the Gloss (document):	i like a hot cup of java in the morning 
>> Enter word for disambiguation:	java
>> The disambiguated meaning is: a beverage consisting of an infusion of ground coffee beans
>> The weight vector is: [0, 0.28768207245178085, 0]

Path length Similarity Disambiguation

The Path Length Similarity computes the minimum hop path between any 2 words in the wordnet corpus using the Hypernym Paths available and then computes the similarity score as -log (pathlen(w1, w2)).

To compute the Path Score and closest synsets between any 2 english words run the path_length_similarity.py file as

git clone https://github.com/anishLearnsToCode/word-sense-disambiguation.git
cd word-sense-disambiguation
pip install -r requirements.txt
cd src
python path_length_similarity.py
>> Enter first word:	dog
>> Enter second word:	wolf
>> Dog Definition: a member of the genus Canis (probably descended from the common wolf) that has been domesticated by man since prehistoric times; occurs in many breeds
>> Wolf Definition: any of various predatory carnivorous canine mammals of North America and Eurasia that usually hunt in packs
>> similarity: -0.6931471805599453

Then to compute the similarity between 6th document with other documents (keywords) in the resume run the path_similarity_resume.py file as:

cd src
python path_similarity_resume.py

See results here. See the explanation and results in Jupyter Notebook.

Resnik Similarity Disambiguation

To compute the similarity between 2 words and find closest possible synsets run resnik_similarity.py as:

git clone https://github.com/anishLearnsToCode/word-sense-disambiguation.git
cd word-sense-disambiguation
pip install -r requirements.txt
cd src
python resnik_similarity.py
>> Enter the first word:	java
>> Enter the second word:	language
>> Java Definition: a platform-independent object-oriented programming language
>> Language Definition: a systematic means of communicating by the use of sounds or conventional symbols
>> similarity: 5.792086967391197

To run this metric on the Resume and see the similarity between 6th document and other documents run the resnik_similarity_resume.py file.

See resnik similarity coefficient matrix here and see explanation and results in this Notebook.

Bibliography

  1. Speech & Language Processing ~Jurafsky
  2. nltk
  3. pickle
  4. pandas
  5. pandas.DataFrames
  6. Indexing and Slicing on Pandas DataFrames
  7. numpy
  8. wordnet interface