/UXSenti

Unsupervised Cross-lingual Sentiment Analysis (CoNLL 2019)

Primary LanguagePython

UXSenti: Unsupervised Cross-lingual Sentiment Analysis

This repository contains the source code for the CoNLL2019 paper Towards a Unified End-to-End Approach for Fully Unsupervised Cross-Lingual Sentiment Analysis. We present models that perform cross-lingual sentiment analysis (CLSA) in a fully unsupervised setting, without target language annotation or cross-lingual supervision. These models rely on language modeling to transfer the sentiment resources from the source language to the target language. We address two different CLSA settings:

  • Cross-lingual in-domain (e.g. English book reviews for training and Japanese book review for testing)

  • Cross-lingual cross-domain (e.g. English book reviews for training and Japanese music reviews for testing)

Supported languages: English (en) / French (fr) / German (de) / Japanese (ja)

Supported domains: books / dvd / music

Citation

@inproceedings{feng-wan-2019-towards,
    title = "Towards a Unified End-to-End Approach for Fully Unsupervised Cross-Lingual Sentiment Analysis",
    author = "Feng, Yanlin  and
      Wan, Xiaojun",
    booktitle = "Proceedings of the 23rd Conference on Computational Natural Language Learning (CoNLL)",
    month = nov,
    year = "2019",
    address = "Hong Kong, China",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/K19-1097",
}

Dependencies

  • Python >= 3.6
  • PyTorch 1.1
  • NLTK (for English / French / German tokenization)
  • Mecab (for Japanese tokenization)

Usage

Download / Preprocess Data

Run the following commands to download the Amazon review dataset:

git clone https://github.com/Evan-Feng/UXSenti.git
cd UXSenti
wget -P data/ http://www.uni-weimar.de/medien/webis/corpora/corpus-webis-cls-10/cls-acl10-unprocessed.tar.gz
tar xvf data/cls-acl10-unprocessed.tar.gz -C data/

Then run the following script to preprocess data:

python preprocess.py

This script will:

  • Tokenize the data
  • Generate vocabularies for all the languages
  • Binarize the training / test data

The resulting file structure will look like:

.
├── README.md
└── data
    ├── train.pth                (binarized training data)
    ├── test.pth                 (binarized test data)
    ├── tokenized
    │   ├── en.books.unlabeled   (tokenized unlabeled data from the EN-BOOK domain)
    │   ├── en.books.train       (tokenized training data)
    │   ├── en.books.test        (tokenized test data)
    │   └── ...
    └── vocab
        ├── en.vocab             (English vocabulary)
        ├── fr.vocab             (French vocabulary)
        ├── de.vocab             (German vocabulary)
        └── ja.vocab             (Japanese vocabulary)

Run Cross-lingual In-Domain SA

Run the following command to train a CLIDSA_{full} model:

python cross_lingual_in_domain.py --export export/clid/

By default, English is used as the source language and the others are used as target languages (note that using multiple target languages only affects the evaluation). To use a different source language, simple run:

python cross_lingual_in_domain.py --src fr --trg en de ja --export export/clid2/

To run the experiment in a different domain:

python cross_lingual_in_domain.py --src fr --trg en de ja --sup_dom music --export export/clid3/

To run CLIDSA_{min}:

python cross_lingual_in_domain.py --lang en fr --src en --trg fr --dom books --sup_dom books --export export/clid4/

To run all the CLIDSA experiments (including CLIDSA_{full}, CLIDSA_{min} and MWE):

bash run_clid.sh

Run Cross-lingual Cross-Domain SA

Run the following command to train a CLCDSA model:

python cross_lingual_cross_domain.py --export export/clcd/

By default the transfer direction is from EN-DVD to DE-Books. To specify a different direction:

python cross_lingual_cross_domain.py --src fr-music --trg de-books --export export/clcd2/

To run all the CLCDSA experiments (including CLCDSA and MWE):

bash run_clcd.sh