/neural_WSD

Word sense disambiguation using neural models, replicating https://research.google.com/pubs/pub45729.html?authuser=0

Primary LanguagePython

Neural WSD

This project aims to replicate Google's "Semi-supervised Word Sense Disambiguation with Neural Models", covering only the LSTM language modeler part with unsupervised training.

Dictionary built using Google English One Million 1-grams, sorting each word by its global frequency.

This project is for educational purposes only

How to use

Setup:

  • Python 3.6.3 (Anaconda custom 64-bit)
  • PyTorch 0.3.1 (0.4.0 might not work due to torch.Tensor and autograd.Variable changes)
  • CUDA 8
  • spaCy v2.0 with English models (more here)
  • project folder must contain a folder named batches in the same directory of the train.py file

Training

Start training by using this command:

python train.py <path/to/training_set> <path/to/model>

where:

  • the training set file is a UTF-8 encoded .txt file;
  • the model file is a pre-existent .pt file (by default: word_guesser.pt).

The model file is not mandatory: if not specified, it will assume there is no model and will create a model file named word_guesser.pt, overwriting it in case it already exists. By starting a training specifying a model file, the training will retrain the model (for example to resume training).

Testing

Start querying the model by using this command:

python query.py <path/to/test_set> <path/to/model>

where:

  • the test set file is a UTF-8 encoded .txt file;
  • model file: (same as training).

The model file is not mandatory: if not specified, it will assume there is a model stored in word_guesser.pt, while specifying a model file, the model stored in that file will be used for predictions.

Features

  • Multi-threaded operation in order to read from the training corpus, split to sentences, create batches, and train the model simultaneously (producer-consumer pattern)
  • Low RAM usage due to sized queues between threads and periodic dumps of created batches
  • Sentences are never padded, instead they are organized by their length and then created batches from sentences of all the same length
  • Dynamic batch size: will try to create batches of maximal size (hyper-parameter batch_dim) as much as possible, but batches smaller than the chosen size will not be padded

Known bugs/problems

  • Missing batches folder creation if not present
  • Training corpus only accepted format is UTF-8 encoded plain text
  • Slow on computation of large training corpus, might become faster implementing hierarchical softmax or negative sampling

Consulted resources