/hypernym_discovery

A hypernym discovery system which learns to predict is-a relationships between words using projection learning

Primary LanguagePython

Hypernym discovery

A hypernym discovery system which learns to predict is-a relationships between words using projection learning (see http://aclweb.org/anthology/S18-1116).

The data of SemEval-2018 Task 9 is used for training and testing.

Requirements

  • Python 3 (tested using version 3.6.9)
  • PyTorch (tested using version 1.2.0)
  • Pyhocon
  • Joblib
  • Lots of disk space (downloading, unzipping, and preprocessing the corpus requires around 40 GB for sub-task 1A)
  • Bash (to get the data and corpora, and to install word2vec)
  • C compiler if you install word2vec

Usage

Make directory where we can store a lot of data:

mkdir [dir-data]

Get training and evaluation data from the website of SemEval-2018 Task 9 (also copies scoring script in current directory):

./get_data.sh [dir-data]

Get corpus from the website of SemEval-2018 Task 9:

./get_corpus.sh [subtask dir-data]

Make preprocessed corpus and vocab:

python prep_corpus.py [subtask path-corpus dir-datasets path-output]

Install word2vec in current directory:

./install_word2vec.sh

Train word embeddings on corpus using word2vec. Make sure to use the corpus and vocab that were produced by prep_corpus.py (not the raw corpus):

word2vec/trunk/word2vec -train [path-preprocessed-corpus] -read-vocab [path-preprocessed-corpus].vocab -output [path-output] -cbow 0 -negative 10 -size 200 -window 7 -sample 1e-5 -min-count 1 -iter 10 -threads 8 -binary 0 

Preprocess data and write in a pickle file:

python prep_data.py [subtask dir-datasets path-embeddings path-output]

Review hyperparameter settings in hparams.conf.

Train model on training and dev data in pickle file, write a model and a log file in dir-model:

python train.py [path-pickle path-hparams dir-model]

Load trained model, make predictions on test queries:

python predict.py [path-model path-pickle path-output]

Evaluate predictions on the test set using scoring script of SemEval-2018 Task 9:

python2.7 path/to/SemEval-Task9/task9-scorer.py path/to/SemEval-Task9/test/gold/<subtask>.<language>.test.gold.txt path/to/output/pred.txt