/biadapt

Primary LanguageJupyter NotebookApache License 2.0Apache-2.0

Methods for Domain Adaptation of Bilingual Tasks

This repository contains implementation of the work Two Methods for Domain Adaptation of Bilingual Tasks: Delightfully Simple and Broadly Applicable. We use off-the-shelf systems for downstream tasks. The modified code can also be found in the repository.

Cite

@InProceedings{P18-1075,
  author = 	"Hangya, Viktor
  		and Braune, Fabienne
		and Fraser, Alexander
		and Sch{\"u}tze, Hinrich",
  title = 	"Two Methods for Domain Adaptation of Bilingual Tasks: Delightfully Simple and Broadly Applicable",
  booktitle = 	"Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
  year = 	"2018",
  publisher = 	"Association for Computational Linguistics",
  pages = 	"810--820",
  location = 	"Melbourne, Australia",
  url = 	"http://aclweb.org/anthology/P18-1075"
}

Requirements

  • Python 3.5
  • dependencies in requirements.txt
	pip install -r requirements.txt

Cross Lingual Sentiment Classification

Target-ignorant system

  • Follow the procedure at section Semi-supervised below
  • Set visit and walker weights to 0.0

Target-aware system

  • As target-aware system we used the method of (Zhang et al., 2016)
  • To convert data to iob format use the script: scripts/to_iob.py

Semi-supervised

Data

  • Domain specific data: We provide the tweet IDs for the 22M_tweets dataset, run:
wget http://www.cis.uni-muenchen.de/~hangyav/data/22M_tweet_ids.tar.bz2 -O - | tar -xj
  • General domain data: OpenSubtitles parallel corpus
  • Bilingual lexicon: BNC included in this repository
  • Sentiment data: RepLab

Bilingual Lexicon Induction

Cosine similarity

  • scripts/bll_with_threshold.py: also use for fine tuning of the threshold on the developement set (use -h to get input parameters)

Classification

As the classifier to perform BLI we used the method introduced by (Heyman et al., 2017).

Requirements

  • A different environment is needed due to the use of Python 2.7 in the original code of the classifier
  • An easy way to deal with different environments is Conda
  • dependencies in BLI_classifier/requirements.txt
	pip install -r BLI_classifier/requirements.txt

Classifier

  • To download data, embeddings and lexicon released by (Heyman et al., 2017) run: scripts/get_eacl_data.sh
  • An example script demonstrating the use of the system: scripts/run_BLI_classifier.sh

Semi-supervised

  • An example script demonstrating the use of the system: scripts/run_BLI_classifier_semisup.sh

Data

  • Domain specific data and train/dev/test lexicons: Link or by running: scripts/get_eacl_data.sh
  • General domain data: Europarl (v7)
  • Bilingual lexicon: BNC included in this repository