/DataAug4NLP

Collection of papers and resources for data augmentation for NLP.

Data Augmentation Techniques for NLP

If you'd like to add your paper, do not email us. Instead, read the protocol for adding a new entry and send a pull request.

We group the papers by text classification, translation, summarization, question-answering, sequence tagging, parsing, grammatical-error-correction, generation, dialogue, multimodal, mitigating bias, mitigating class imbalance, and adversarial examples.

This repository is based on our paper, "A survey of data augmentation approaches in NLP (Findings of ACL '21)". You can cite it as follows:

@article{feng2021survey,
  title={A Survey of Data Augmentation Approaches for NLP},
  author={Feng, Steven Y and Gangal, Varun and Wei, Jason and Chandar, Sarath and Vosoughi, Soroush and Mitamura, Teruko and Hovy, Eduard},
  journal={Findings of ACL},
  year={2021}
}

Authors: Steven Y. Feng, Varun Gangal, Jason Wei, Sarath Chandar, Soroush Vosoughi, Teruko Mitamura, Eduard Hovy

Note: inquiries should be directed to stevenyfeng@gmail.com or by opening an issue here.

Text Classification

Paper Datasets
Synonym Replacement (Character-Level Convolutional Networks for Text Classification, NeurIPS '15) AG’s News, DBPedia, Yelp, Yahoo Answers, Amazon
That’s So Annoying!!!: A Lexical and Frame-Semantic Embedding Based Data Augmentation Approach to Automatic Categorization of Annoying Behaviors using #petpeeve Tweets (EMNLP '15) twitter
Robust Training under Linguistic Adversity (EACL '17) code Movie review, customer review, SUBJ, SST
Contextual Augmentation: Data Augmentation by Words with Paradigmatic Relations (NAACL '18) code SST, SUBJ, MRQA, RT, TREC
Variational Pretraining for Semi-supervised Text Classification (ACL '19) code IMDB, AG News, Yahoo, hatespeech
EDA: Easy Data Augmentation Techniques for Boosting Performance on Text Classification Tasks (EMNLP '19) code SST, CR, SUBJ, TREC, PC
Nonlinear Mixup: Out-Of-Manifold Data Augmentation for Text Classification (AAAI '20) TREC, SST, Subj, MR
MixText: Linguistically-Informed Interpolation of Hidden Space for Semi-Supervised Text Classification (ACL '20) code AG News, DBpedia, Yahoo, IMDb
Unsupervised Data Augmentation for Consistency Training (NeurIPS '20) code Yelp, IMDb, amazon, DBpedia
Not Enough Data? Deep Learning to the Rescue! (AAAI '20) ATIS, TREC, WVA
SSMBA: Self-Supervised Manifold Based Data Augmentation for Improving Out-of-Domain Robustness (EMNLP '20) code IWSLT'14
Data Boost: Text Data Augmentation Through Reinforcement Learning Guided Conditional Generation (EMNLP '20) ICWSM 20’ Data Challenge, SemEval '17 sentiment analysis, SemEval '18 irony
Textual Data Augmentation for Efficient Active Learning on Tiny Datasets (EMNLP '20) SST2, TREC
Text Augmentation in a Multi-Task View (EACL '21) SST2, TREC, SUBJ
Few-Shot Text Classification with Triplet Loss, Data Augmentation, and Curriculum Learning (NAACL '21) code HUFF, COV-Q, AMZN, FEWREL

Natural Language Generation

Paper Datasets
GenAug: Data Augmentation for Finetuning Text Generators (DeeLIO @ EMNLP '20) code TO-DO

Translation

Paper Datasets
Backtranslation (Improving Neural Machine Translation Models with Monolingual Data, ACL '16) WMT '15 en-de, IWSLT ''15 en-tr
SwitchOut: an Efficient Data Augmentation Algorithm for Neural Machine Translation (EMNLP '18) IWSLT '15 en-vi, IWSLT '16 de-en, WMT '15 en-de
Soft Contextual Data Augmentation for Neural Machine Translation (ACL '19) code IWSLT '14 de/es/he-en, WMT '14 en-de

Question Answering

Paper Datasets
An Exploration of Data Augmentation and Sampling Techniques for Domain-Agnostic Question Answering (EMNLP '19 Workshop) MRQA
Data Augmentation for BERT Fine-Tuning in Open-Domain Question Answering (arxiv '19) SQuAD, Trivia-QA, CMRC, DRCD
XLDA: Cross-Lingual Data Augmentation for Natural Language Inference and Question Answering (arxiv '19) XNLI, SQuAD
Synthetic Data Augmentation for Zero-Shot Cross-Lingual Question Answering (arxiv '20) MLQA, XQuAD, SQuAD-it, PIAF
Logic-Guided Data Augmentation and Regularization for Consistent Question Answering (ACL '20) code WIQA, QuaRel, HotpotQA

Summarization

Paper Datasets
Transforming Wikipedia into Augmented Data for Query-Focused Summarization (arxiv '19) DUC
Iterative Data Augmentation with Synthetic Data (Abstract Text Summarization: A Low Resource Challenge (EMNLP '19) Swisstext, commoncrawl
Improving Zero and Few-Shot Abstractive Summarization with Intermediate Fine-tuning and Data Augmentation (NAACL '21) CNN-DailyMail

Sequence Tagging

Paper Datasets
Data Augmentation via Dependency Tree Morphing for Low-Resource Languages (EMNLP '18) code universal dependencies project

Parsing

TODO: https://www.aclweb.org/anthology/2020.emnlp-main.107/

Grammatical Error Correction

Paper Datasets
Using Wikipedia Edits in Low Resource Grammatical Error Correction. (WNUT @ EMNLP '18) Falko-MERLIN GEC Corpus
Sequence-to-sequence Pre-training with Data Augmentation for Sentence Rewriting (arxiv '19) CoNLL-2014 , JFLEG
SwitchOut: an Efficient Data Augmentation Algorithm for Neural Machine Translation (EMNLP '18) IWSLT 16 en-vi, IWSLT 15 de-en, WMT en-de
Neural Grammatical Error Correction Systems with Unsupervised Pre-training on Synthetic Data. (BEA @ ACL '19) FCE, NUCLE, W&I+LOCNESS, Lang-8 (BEA @ ACL '19 Shared Task)
A neural grammatical error cor-rection system built on better pre-training and se-quential transfer learning. (BEA @ ACL '19) FCE, NUCLE, W&I+LOCNESS, Lang-8 (BEA @ ACL '19 Shared Task), Gutenberg, Tatoeba, WikiText-103 (Pretraining)
Improving Grammatical Error Correction with Data Augmentation by Editing Latent Representation (COLING'20) FCE, NUCLE, W&I+LOCNESS, Lang-8 (BEA @ ACL '19 Shared Task)
Noising and Denoising Natural Language: Diverse Backtranslation for Grammar Correction. (NAACL'18) Lang-8, CoNLL-2014, CoNLL-2013, JFLEG
Corpora Generation for Grammatical Error Correction (NAACL'19) CoNLL-2014, JFLEG, Lang-8

Dialogue

Multimodal

Mitigating Bias

Mitigating Class Imbalance

Adversarial examples

Paper Datsets
Adversarial Example Generation with Syntactically Controlled Paraphrase Networks (NAACL '18) SST, SICK
Make sure we get textattack

Compositionality

Paper Datsets
Good-Enough Compositional Data Augmentation (ACL '20) code SCAN
Sequence-Level Mixed Sample Data Augmentation (EMNLP '20) code SCAN

Popular Resources