-
The code uses Python 3.8, Pytorch 1.6.0, 🤗 Transformers (Note that PyTorch 1.6.0 requires CUDA 10.2, if you want to decontextualize Transformer-based embeddings on a GPU)
-
Install PyTorch and 🤗
transformers
: First runpip install pytorch
(orconda install pytorch torchvision -c pytorch
), and thenpip install transformers
. You can also install PyTorch andtransformers
in a single line withpip install transformers[torch]
. -
Install Python dependencies:
pip install -r requirements.txt
-
Install Cognival. First run
cd cognival-cli
, thenpip install -r requirements.txt
, thenpython setup.py install
Scripts for pulling and processing datasets can be found in the data/datasets
directory. For example, to pull the wikitext-2 dataset, run
bash prepare-wikitext-2.sh
We use Ares embeddings (see get_ares.sh
) to determine the sense of a word in its context. Specifically, following the recommendations of the Ares authors, we choose the word sense who's Ares embedding is closest (cosine distance) to the contextualized embedding produced by BERT. To annotate a corpus with word senses, run the following command
python annotate_corpus.py --corpus_path /path/to/corpus/txt --ares_path /path/to/ares/embedding/txt/file --wordset_path /csv/with/words/to/annotate --out_path /pkl/output/file/name
where the wordset_path
should be, e.g., the simlex999 word set. This will produce a pkl file (at out_path
) which is a mapping from word -> wordsenses -> sentence index and string ranges for each word sense in the given corpus.
Once a mapping has been created, we can use all (or a sample of) the contextual embeddings associated with a word sense to create a new embedding for that word sense.
To create new embeddings, run the following command
python extract_embeddings.py --corpus_path /path/to/corpus/txt --idx_path /path/to/mapping/pkl --out_path /output/dir
This will create an text file with the embeddings at /output/dir
. Additional parameters, such as language model, number of embeddings to aggregate over, and pooling function, can be set using flags. Run with the --help
flag to see all options.
- Add these lines to your
~/.bashrc
:export PYTHONPATH=$PYTHONPATH:path/to/embedding_evaluation/
export EMBEDDING_EVALUATION_DATA_PATH='path/to/embedding_evaluation/data/'
The global approach is given with those simple lines.
Your embeddings are in a .txt
file that contains a word followed by it's embeddings, space separated.
from embedding_evaluation.evaluate import Evaluation
from embedding_evaluation.load_embedding import load_embedding_textfile
# Load embeddings as a dictionnary {word: embed} where embed is a 1-d numpy array.
embeddings = load_embedding_textfile("path/to/my/embeddings.txt")
# Load and process evaluation benchmarks
evaluation = Evaluation()
results = evaluation.evaluate(embeddings)
evaluation.save_summary_to_file(results, 'results_summary.json')