Code for our paper "Improving Vector Space Word Representations Via Kernel Canonical Correlation Analysis" in TALLIP [pdf]
This software runs python 3.6 with the following libraries:
- numpy 1.16.2
- scikit-learn 0.20.2
- Preparing monolingual word embeddings and dictionaris.
$word2vec/word2vec -train $corpus_en -window 5 -iter 10 -size 200 -threads 16 -output embeddings_size200.en
$word2vec/word2vec -train $corpus_zh -window 5 -iter 10 -size 200 -threads 16 -output embeddings_size200.zh
- Generating bilingual word embeddings with our method (BiKCCA).
python train.py -slang $src_lang -tlang $tgt_lang -semb $src_path -temb $tgt_path -d $dict_path -reg 0.3 -g1 0.001 -g2 0.001
The `reg`, `g1` and `g2` are hyperparameters of KCCA, which can be tuned on valid dataset.
-
The resulted bilingual word embeddings will be stored at directory
output/src_lang-tgt_lang/
-
To evaluate the bilingual word embeddings, please refer to the code of this work
Please cite Learning Improving Vector Space Word Representations Via Kernel Canonical Correlation Analysis if you found the resources in this repository useful.
@article{BaiCZ-18-tallip,
author = {Bai, Xuefeng and Cao, Hailong and Zhao, Tiejun},
title = {Improving Vector Space Word Representations Via Kernel Canonical Correlation Analysis},
journal = {ACM Transactions on Asian and Low-Resource Language Information Processing},
issue_date = {August 2018},
publisher = {ACM},
address = {New York, NY, USA}
}