Hypernym discovery
A hypernym discovery system which learns to predict is-a relationships between words using projection learning (see http://aclweb.org/anthology/S18-1116).
The data of SemEval-2018 Task 9 is used for training and testing.
Requirements
- Python 3 (tested using version 3.6.9)
- PyTorch (tested using version 1.2.0)
- Pyhocon
- Joblib
- Lots of disk space (downloading, unzipping, and preprocessing the corpus requires around 40 GB for sub-task 1A)
Bash
(to get the data and corpora, and to installword2vec
)C
compiler if you installword2vec
Usage
Make directory where we can store a lot of data:
mkdir [dir-data]
Get training and evaluation data from the website of SemEval-2018 Task 9 (also copies scoring script in current directory):
./get_data.sh [dir-data]
Get corpus from the website of SemEval-2018 Task 9:
./get_corpus.sh [subtask dir-data]
Make preprocessed corpus and vocab:
python prep_corpus.py [subtask path-corpus dir-datasets path-output]
Install word2vec
in current directory:
./install_word2vec.sh
Train word embeddings on corpus using word2vec
. Make sure to use the corpus and vocab that were produced by prep_corpus.py
(not the raw corpus):
word2vec/trunk/word2vec -train [path-preprocessed-corpus] -read-vocab [path-preprocessed-corpus].vocab -output [path-output] -cbow 0 -negative 10 -size 200 -window 7 -sample 1e-5 -min-count 1 -iter 10 -threads 8 -binary 0
Preprocess data and write in a pickle file:
python prep_data.py [subtask dir-datasets path-embeddings path-output]
Review hyperparameter settings in hparams.conf
.
Train model on training and dev data in pickle file, write a model and a log file in dir-model
:
python train.py [path-pickle path-hparams dir-model]
Load trained model, make predictions on test queries:
python predict.py [path-model path-pickle path-output]
Evaluate predictions on the test set using scoring script of SemEval-2018 Task 9:
python2.7 path/to/SemEval-Task9/task9-scorer.py path/to/SemEval-Task9/test/gold/<subtask>.<language>.test.gold.txt path/to/output/pred.txt