This code is an implementation of data augmentation for natural language processing tasks. The data augmentation is an expanding training data method, which generates pseudo-sentences from supervised sentences.
This code is depend on the following.
- python>=3.6.5
git clone https://github.com/tkmaroon/data-augmentation-for-nlp.git
cd repository
pip install -r requirements.txt
You can choose a data augmentation strategy, using a combination of a sampling strategy --sampling-strategy
and a generation strategy --augmentation-strategy
.
This option decides how to sample token's positions in original sentence pairs.
storategy | description |
---|---|
random | randomly sample tokens. |
- | - |
storategy | description |
---|---|
dropout | Drop a token [1, 2]; |
blank | Replace a token with a placeholder token [3]; |
unigram | Replace a token with a sample from the unigram frequency distribution over the vocabulary [3]. Please set the option --unigram-frequency-for-generation . [3]; |
bigramkn | Bigram Kneser-Ney smoothing [3]. Please set the option --bigram-frequency-for-generation . |
wordnet | Replace a token with a synonym of wordnet. Please set the option --lang-for-wordnet . |
ppdb | Replace a token with a paraphrase by given paraprase database. Please set the option --ppdb-file . |
word2vec | Replace a token with a token which has similar vector of word2vec. Please set the option --w2v-file . |
bert | Replace a token using output probability of BERT mask token prediction. Please set the option --model-name-or-path . It must be in the shortcut name list of hugging face's pytorch-transformers. Note that the option --vocab-file must be same the vocabulary file of a BERT tokenizer. |
- | - |
python generate.py \
--input ./data/sample/sample.txt \
--augmentation-strategy bert \
--model-name-or-path bert-base-multilingual-uncased \
--temparature 1.0