adversarial-postspec

The implementation of a GAN with an auxiliary distance loss to post-specialise word embeddings. Code for the paper:

Edoardo Maria Ponti, Ivan Vulić, Goran Glavaš, Nikola Mrkšić, and Anna Korhonen. 2018. Adversarial Propagation and Zero-Shot Cross-Lingual Transfer of Word Vector Specialization. In Proceedings of EMNLP 2018. [arXiv]

If you use this software for academic research, please cite the paper in question:

@inproceedings{ponti2018adversarial,
  title={Adversarial Propagation and Zero-Shot Cross-Lingual Transfer of Word Vector Specialization},
  author={Ponti, Edoardo Maria and Vulić, Ivan and Glavaš, Goran and Mrkšić, Nikola and Korhonen, Anna},
  booktitle={Proceedings of EMNLP 2018},
  year={2018}
}

Directory structure

code: contains main scripts to train and test AuxGAN for post-specialization
evaluation: script and databases for intrinsic evaluation
results: where the output of the post-specialization procedure is saved
vectors: contains the training data (i.e. original vectors and vectors specialized by Attract-Repel)
vocab: list of words to be excluded from the training data because they are present in the evaluation databases.

Data

All vectors and models for English are available from Google Drive at the following link.

We provide the vectors for Skip-gram with Negative Sampling (sngs), Glove (glove), and Fasttext (ft). Separate files contain: 1) the original distributional vectors of the entire vocabulary (prefix); 2) the original distributional vectors of the words seen in the constraints (distrib); 3) the vectors that underwent Attract-Repel (ar); 4) and post-specialized vectors (postspec).

The folder also contains some pre-trained models. These can be applied to new original distributional embeddings (e.g. from other languages), provided that they have been previously aligned with our original distributional spaces. In our experiments, we performed unsupervised alignments with MUSE.

Finally, the subfolder xling contains post-specialized Fasttext embeddings for Italian and German.

Train

 cd code
 python adversarial.py --seen_file ../vectors/SEEN_VECTORS --adjusted_file ../vectors/AR_SPECIALIZED_VECTORS \\
     --unseen_file ../vectors/ALL_ORIGINAL_VECTORS --out_dir ../results/EXPERIMENT_NAME

After completing the epochs, the script saves two files in the folder specified as --out_dir: gold_embs.txt and silver_embs.txt. They are based on two different settings where AR specialized vectors are saved when available (gold) and only post-specialized vectors are saved (silver). The paper reports the gold setting.

Apply Pre-trained Mappings

 cd code
 python export.py --in_file ../vectors/IN_VECTORS --out_file ../vectors/OUT_VECTORS \\
     --params ../models/EXPERIMENT_SETTINGS.pkl --model ../models/MAPPING_PARAMETERS.t7

Evaluate

To evaluate with SimLex-999 (or SimVerb-3500), you have to call the evaluation script in the evaluation/ directory:

python simlex_evaluator.py simlexorig999.txt ../out_dir/<output_file>

Our papers reports state-of-art scores for both SimLex and SimVerb: in the disjoint setting, the words appearing in such datasets were discarded from the Attract-Repel constraints.

	Disjoint						Full
	glove-cc		fasttext		sgns-w2		glove-cc		fasttext		sgns-w2
	SL	SV	SL	SV	SL	SV	SL	SV	SL	SV	SL	SV
Distributional	.407	.280	.383	.247	.414	.272	.407	.280	.383	.247	.414	.272
Specialized: Attract-Repel	.407	.280	.383	.247	.414	.272	.781	.761	.764	.744	.778	.761
Post-Specialized: MLP	.645	.531	.503	.340	.553	.430	.785	.764	.768	.745	.781	.763
Post-Specialized: AuxGAN	.652	.552	.513	.394	.581	.434	.789	.764	.766	.741	.782	.762

Acknowledgements

Part of the code has been borrowed from the GAN implementation in MUSE, with some changes. The link contains a copy of the original license.