The implementation of a GAN with an auxiliary distance loss to post-specialise word embeddings. Code for the paper:
Edoardo Maria Ponti, Ivan Vulić, Goran Glavaš, Nikola Mrkšić, and Anna Korhonen. 2018. Adversarial Propagation and Zero-Shot Cross-Lingual Transfer of Word Vector Specialization. In Proceedings of EMNLP 2018. [arXiv]
If you use this software for academic research, please cite the paper in question:
@inproceedings{ponti2018adversarial,
title={Adversarial Propagation and Zero-Shot Cross-Lingual Transfer of Word Vector Specialization},
author={Ponti, Edoardo Maria and Vulić, Ivan and Glavaš, Goran and Mrkšić, Nikola and Korhonen, Anna},
booktitle={Proceedings of EMNLP 2018},
year={2018}
}
- code: contains main scripts to train and test AuxGAN for post-specialization
- evaluation: script and databases for intrinsic evaluation
- results: where the output of the post-specialization procedure is saved
- vectors: contains the training data (i.e. original vectors and vectors specialized by Attract-Repel)
- vocab: list of words to be excluded from the training data because they are present in the evaluation databases.
All vectors and models for English are available from Google Drive at the following link.
We provide the vectors for Skip-gram with Negative Sampling (sngs), Glove (glove), and Fasttext (ft). Separate files contain: 1) the original distributional vectors of the entire vocabulary (prefix); 2) the original distributional vectors of the words seen in the constraints (distrib); 3) the vectors that underwent Attract-Repel (ar); 4) and post-specialized vectors (postspec).
The folder also contains some pre-trained models. These can be applied to new original distributional embeddings (e.g. from other languages), provided that they have been previously aligned with our original distributional spaces. In our experiments, we performed unsupervised alignments with MUSE.
Finally, the subfolder xling contains post-specialized Fasttext embeddings for Italian and German.
cd code
python adversarial.py --seen_file ../vectors/SEEN_VECTORS --adjusted_file ../vectors/AR_SPECIALIZED_VECTORS \\
--unseen_file ../vectors/ALL_ORIGINAL_VECTORS --out_dir ../results/EXPERIMENT_NAME
After completing the epochs, the script saves two files in the folder specified as --out_dir
: gold_embs.txt
and silver_embs.txt
. They are based on two different settings where AR specialized vectors are saved when available (gold) and only post-specialized vectors are saved (silver). The paper reports the gold setting.
cd code
python export.py --in_file ../vectors/IN_VECTORS --out_file ../vectors/OUT_VECTORS \\
--params ../models/EXPERIMENT_SETTINGS.pkl --model ../models/MAPPING_PARAMETERS.t7
To evaluate with SimLex-999 (or SimVerb-3500), you have to call the evaluation script in the evaluation/
directory:
python simlex_evaluator.py simlexorig999.txt ../out_dir/<output_file>
Our papers reports state-of-art scores for both SimLex and SimVerb: in the disjoint setting, the words appearing in such datasets were discarded from the Attract-Repel constraints.
Disjoint | Full | |||||||||||
glove-cc | fasttext | sgns-w2 | glove-cc | fasttext | sgns-w2 | |||||||
SL | SV | SL | SV | SL | SV | SL | SV | SL | SV | SL | SV | |
Distributional | .407 | .280 | .383 | .247 | .414 | .272 | .407 | .280 | .383 | .247 | .414 | .272 |
Specialized: Attract-Repel | .407 | .280 | .383 | .247 | .414 | .272 | .781 | .761 | .764 | .744 | .778 | .761 |
Post-Specialized: MLP | .645 | .531 | .503 | .340 | .553 | .430 | .785 | .764 | .768 | .745 | .781 | .763 |
Post-Specialized: AuxGAN | .652 | .552 | .513 | .394 | .581 | .434 | .789 | .764 | .766 | .741 | .782 | .762 |
Part of the code has been borrowed from the GAN implementation in MUSE, with some changes. The link contains a copy of the original license.