IMT Style Transfer

Introduction

This repository contains training suite for unsupervised text style transfer based on the iterative matching and translation method proposed in this paper by Jin, Zhijing, et al. The training suite includes custom codes to bootstrap and refine pseudo-parallel dataset. All other heavy lifting is done by OpenNMT-py.

Installation

Step 1: Clone the repository

git clone --recursive https://github.com/BPYap/IMT-Style-Transfer
cd IMT-Style-Transfer

Step 2: Install dependencies

python3 -m virtualenv env
source env/bin/activate

pip install -r requirements.txt
python setup.py install

Step 3: Download pretrained models for sentence encoder

fastText

Download fastText English vectors [direct link]
Decompress and put cc.en.300.bin under model/pretrained/fastText directory

GloVe

Download spaCy pretrained GloVe model [direct link]
Decompress and put en_vectors_web_lg-2.1.0 (the most nested folder) under model/pretrained/spacy_glove directory

Universal Sentence Encoder

Download the transformer variant of Universal Sentence Encoder [direct link]
Decompress and put assets, variables, saved_model.pb and tfhub_module.pb under model/pretrained/universal_sentence_encoder directory

Usage

python script/imt_train.py CONFIG

argument:
  CONFIG path to config file (e.g.: config/experiments/sample.yml)

Configuration file

This script reads all configuration settings from a single yaml file. To get started, copy the provided sample.yml file in config/experiments/ folder and modify the value of each parameter accordingly. Each parameter (other than the general configurations) is prefixed by the name of pipeline component in the training suite. For example, bootstrap_corpus-sentence_encoder indicates the sentence_encoder parameter used by the bootstrap_corpus component.

There are in total 6 types of configurable parameter:

general

src_corpus: Path to unaligned source corpus.
tgt_corpus: Path to unaligned target corpus.
min_update_rate: Convergence criteria. The iterative process stops when the overall update rate of the newly generated pseudo-parallel corpus is lower than this value.

bootstrap_corpus

sentence_encoder: Type of sentence encoder. Choose between "fasttext" (Average fastText embedding), "glove" (Average GloVe embedding) or "use" (Universal Sentence Encoder).
similarity_threshold: Threshold for cosine similarity score when matching source sentence and target sentence. Source-target pair whose cosine similarity score is lower than this threshold value is discarded.

prepare_dataset

validation_ratio: Ratio to split for validation set.
test_ratio: Ratio to split for test set.

Output

All intermediate data is stored in data/experiments/<experiment_name> while the trained models are stored in model/experiments/<experiment_name>.

To quickly test trained model in interactive way, execute the following:

python script/imt_demo.py [--model_path MODEL_PATH]
 
argument:
  MODEL_PATH path to trained model (e.g.: model/experiments/sample/0-onmt_model_step_1024.pt)

References

Mikolov, Tomas, et al. "Advances in pre-training distributed word representations." arXiv preprint arXiv:1712.09405 (2017).
Pennington, Jeffrey, Richard Socher, and Christopher Manning. "Glove: Global vectors for word representation." Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP). 2014.
Cer, Daniel, et al. "Universal sentence encoder." arXiv preprint arXiv:1803.11175 (2018).
Jin, Zhijing, et al. "Unsupervised Text Style Transfer via Iterative Matching and Translation." arXiv preprint arXiv:1901.11333 (2019).
Klein, Guillaume, et al. "OpenNMT: Neural Machine Translation Toolkit." arXiv preprint arXiv:1805.11462 (2018).

BPYap/IMT-Style-Transfer