Code for our ICLR'19 submission on Poincare GloVe. This repo is a fork of the gensim repository.
To set up the environment for the Poincare GloVe code follow the following steps:
- Install all the dependencies (e.g.
pip3 install Cython nltk annoy
). For the complete list of dependencies, check out the official gensim documentation. git clone https://github.com/alex-tifrea/poincare_glove.git
cd poincare_glove
python3 setup.py develop
This version has been tested under Python 3.6.
For training and evaluating a model, we use run_{word2vec, glove}.sh scripts. Their usage is similar, so the following will focus only on GloVe.
Training: To train a Vanilla (Euclidean) GloVe model, run the following:
./run_glove.sh --train --root path/to/root --coocc_file path/to/coocc/file --vocab_file path/to/vocab/file --epochs 50 --workers 20 --restrict_vocab 200000 --chunksize 1000 --lr 0.05 --bias --size 100
The root should be the folder that contains the repository folder as well (the
folder in which the git clone
command was run). The coocc_file is a binary
file that contains co-occurrence triples in the format generated by the GloVe
preprocessing scripts
(https://github.com/stanfordnlp/GloVe/blob/master/src/cooccur.c). The
vocab_file is a text file that contains the vocabulary and should have a format
similar to the one generated by
https://github.com/stanfordnlp/GloVe/blob/master/src/vocab_count.c.
For Poincare embeddings use something similar to the following:
./run_glove.sh --train --root path/to/root --coocc_file path/to/coocc/file --vocab_file path/to/vocab/file --epochs 50 --workers 20 --restrict_vocab 200000 --lr 0.01 --poincare 1 --bias --size 100 --dist_func cosh-dist-sq
The parameter dist_func
specifies what function h
to choose (where h
is
the notation used in the paper). To see what each of the possible dist_func
options is doing, consult the code in glove_inner.pyx (function
mix_poincare_similarity
and poincare_similarity
)
For a Cartesian product of Poincare balls, the command changes a little bit:
./run_glove.sh --train --root path/to/root --coocc_file path/to/coocc/file --vocab_file path/to/vocab/file --epochs 50 --workers 20 --restrict_vocab 200000 --lr 0.05 --poincare 1 --bias --size 100 --mix --num_embs 50 --dist_func cosh-dist-sq
Here, num_embs
specifies in how many small dimensional embeddings the large
vector of length size
will be split.
A number of training options are available:
--init_pretrained
will initialize the embeddings from a pretrained model. The pretrained model that will be used for a certain embedding size and setting needs to be specified in theglove_main.py
file, by changing theINITIALIZATION_MODEL_FILENAME
variable.--no_redirect
will not redirect the output to a log file, but instead will print it tostdout
.--no_eval
will not run the evaluation of the model at the end, after training is complete.--ckpt_emb
will save snapshots of the values of the embeddings for a number of words, during training, so that they can be used to create 2D animations of how the words' positions change. It is usually used for 2D (or 3D) embeddings, that can be visualized.
Evaluating: Evaluation for all model types works in a similar way. The command format is the following:
./run_glove.sh --eval --restrict_vocab 200000 --root path/to/root --model_file path/to/saved/model
path/to/saved/model
should point to the location of the model that is being
evaluated.
Some of the options that are allowed for evaluation are the following:
--agg
will not evaluate the model only on the target vectors, but instead it will combine information from both the target (w
) and the context (c
) vectors. For the Euclidean case this means simply adding the two,w+c
. In the hyperbolic case, we are instead taking the gyromidpoint betweenw
andc
.--cosine_eval
is particularly useful when evaluating Poincare embeddings. The default is to use the Poincare distance to assess similarity between words and to select the closest word to the gyroparallel transported point according to a query for analogy. This argument allows for using the cosine distance instead.- {
--cosadd
,--cosmul
,--hyp_pt
,--distadd
etc} allow for using a different technique for solving analogy queries, other than the default for each model setup.
Logs: During training/evaluation, some important information is saved in logs that can later be used to generate plots or to debug and investigate the characteristics of the trained embedding models.
In the folder ROOT/logs
the model will save the progress that is recorded
during training, including information about the epoch loss, the vector norms
and the scores on some analogy and similarity benchmarks.
In ROOT/eval_logs
you can find a persistant copy of the output generated by
the evaluation of a model.
Some of the embedding models presented in the paper are made available here. The embeddings have been trained on a dump of the English Wikipedia containing 1.4 billion tokens. For more details about the training setup, please consult the experiments section of our paper.
If you find this code useful for your research, please cite the following paper:
@inproceedings{
tifrea2018poincare,
title={Poincare Glove: Hyperbolic Word Embeddings},
author={Alexandru Tifrea and Gary Becigneul and Octavian-Eugen Ganea},
booktitle={International Conference on Learning Representations},
year={2019},
url={https://openreview.net/forum?id=Ske5r3AqK7},
}