Data Augmentation for Natural Language Processing

This code is an implementation of data augmentation for natural language processing tasks. The data augmentation is an expanding training data method, which generates pseudo-sentences from supervised sentences.

Installation

This code is depend on the following.

python>=3.6.5

git clone https://github.com/tkmaroon/data-augmentation-for-nlp.git
cd repository
pip install -r requirements.txt

Data Augmentation

You can choose a data augmentation strategy, using a combination of a sampling strategy --sampling-strategy and a generation strategy --augmentation-strategy.

Sampling storategies (`--sampling-storategy`)

This option decides how to sample token's positions in original sentence pairs.

storategy	description
random	randomly sample tokens.
-	-

Generation storategies

storategy	description
dropout	Drop a token [1, 2];
blank	Replace a token with a placeholder token [3];
unigram	Replace a token with a sample from the unigram frequency distribution over the vocabulary [3]. Please set the option `--unigram-frequency-for-generation`. [3];
bigramkn	Bigram Kneser-Ney smoothing [3]. Please set the option `--bigram-frequency-for-generation`.
wordnet	Replace a token with a synonym of wordnet. Please set the option `--lang-for-wordnet`.
ppdb	Replace a token with a paraphrase by given paraprase database. Please set the option `--ppdb-file`.
word2vec	Replace a token with a token which has similar vector of word2vec. Please set the option `--w2v-file`.
bert	Replace a token using output probability of BERT mask token prediction. Please set the option `--model-name-or-path`. It must be in the shortcut name list of hugging face's pytorch-transformers. Note that the option `--vocab-file` must be same the vocabulary file of a BERT tokenizer.
-	-

Usage

python generate.py \
    --input ./data/sample/sample.txt \
    --augmentation-strategy bert \
    --model-name-or-path bert-base-multilingual-uncased \
    --temparature 1.0

References

[1] Mohit Iyyer, Varun Manjunatha, Jordan Boyd-Graber, and Hal Daume III. Deep unordered com- ´ position rivals syntactic methods for text classification. ACL2015, volume 1, pages 1681–1691.

[2] Guillaume Lample, Alexis Conneau, Ludovic Denoyer, and Marc’Aurelio Ranzato. 2017. Unsupervised machine translation using monolingual corpora only. arXiv preprint arXiv:1711.00043.

[3] Ziang Xie, Sida I Wang, Jiwei Li, Daniel Levy, Aiming ´ Nie, Dan Jurafsky, and Andrew Y Ng. 2017. Data noising as smoothing in neural network language models. arXiv preprint arXiv:1703.02573.

tm4roon/data-augmentation-for-nlp