This repository contains implementation of the work Two Methods for Domain Adaptation of Bilingual Tasks: Delightfully Simple and Broadly Applicable. We use off-the-shelf systems for downstream tasks. The modified code can also be found in the repository.
@InProceedings{P18-1075,
author = "Hangya, Viktor
and Braune, Fabienne
and Fraser, Alexander
and Sch{\"u}tze, Hinrich",
title = "Two Methods for Domain Adaptation of Bilingual Tasks: Delightfully Simple and Broadly Applicable",
booktitle = "Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
year = "2018",
publisher = "Association for Computational Linguistics",
pages = "810--820",
location = "Melbourne, Australia",
url = "http://aclweb.org/anthology/P18-1075"
}
- Python 3.5
- dependencies in requirements.txt
pip install -r requirements.txt
- Follow the procedure at section Semi-supervised below
- Set visit and walker weights to 0.0
- As target-aware system we used the method of (Zhang et al., 2016)
- To convert data to iob format use the script: scripts/to_iob.py
- For the semi-supervised system for sentiment we modified the original implementation of (Haeusser et al. 2017)
- We added the implementation of (Kim (2014)’s CNN-non-static)
- An example script demonstrating the use of the system: scripts/run_semisup_sentiment.sh
- Domain specific data: We provide the tweet IDs for the 22M_tweets dataset, run:
wget http://www.cis.uni-muenchen.de/~hangyav/data/22M_tweet_ids.tar.bz2 -O - | tar -xj
- General domain data: OpenSubtitles parallel corpus
- Bilingual lexicon: BNC included in this repository
- Sentiment data: RepLab
- scripts/bll_with_threshold.py: also use for fine tuning of the threshold on the developement set (use -h to get input parameters)
As the classifier to perform BLI we used the method introduced by (Heyman et al., 2017).
- A different environment is needed due to the use of Python 2.7 in the original code of the classifier
- An easy way to deal with different environments is Conda
- dependencies in BLI_classifier/requirements.txt
pip install -r BLI_classifier/requirements.txt
- To download data, embeddings and lexicon released by (Heyman et al., 2017) run: scripts/get_eacl_data.sh
- An example script demonstrating the use of the system: scripts/run_BLI_classifier.sh
- An example script demonstrating the use of the system: scripts/run_BLI_classifier_semisup.sh
- Domain specific data and train/dev/test lexicons: Link or by running: scripts/get_eacl_data.sh
- General domain data: Europarl (v7)
- Bilingual lexicon: BNC included in this repository