Code for the paper "Rotom: A Meta-Learned Data Augmentation Framework for Entity Matching, Data Cleaning, Text Classification, and Beyond"
- Python 3.7.7
- PyTorch 1.4.0
- Transformers 3.1.0
- NLTK (stopwords, wordnet)
- NVIDIA Apex (fp16 training)
Install required packages
pip install -r requirements.txt
git clone https://github.com/NVIDIA/apex.git
pip install -v --disable-pip-version-check --no-cache-dir ./apex
Install Snippext:
git clone -b v1.0 https://github.com/rit-git/Snippext_public.git
cp -r Snippext_public/snippext .
To train a model with Rotom:
CUDA_VISIBLE_DEVICES=0 python train_any.py \
--task em_DBLP-ACM \
--size 300 \
--logdir results_em/ \
--finetuning \
--batch_size 32 \
--lr 3e-5 \
--n_epochs 20 \
--max_len 128 \
--fp16 \
--lm roberta \
--da auto_filter_weight \
--balance \
--run_id 0
The current version supports 3 tasks: entity matching (EM), error detection (EDT), and text classification (TextCLS). The supported tasks are:
Type | Dataset Names | taskname pattern |
---|---|---|
EM | Abt-Buy, Amazon-Google, DBLP-ACM, DBLP-GoogleScholar, Walmart-Amazon | em_{dataset}, e.g., em_DBLP-ACM |
EDT | beers, hospital, movies, rayyan, tax | cleaning_{dataset}, e.g., cleaning_beers |
TextCLS | AG, AMAZON2, AMAZON5, ATIS, IMDB, SNIPS, SST-2, SST-5, TREC | textcls_{dataset}, e.g., textcls_AG |
TextCLS, splits from Hu et al. | IMDB, SST-5, TREC | compare1_{dataset}, e.g., compare1_IMDB |
TextCLS, splits from Kumar et al. | ATIS, SST-2, TREC | compare2_{dataset}, e.g., compare2_ATIS |
Parameters:
--task
: the taskname pattern specified following the above table--size
: the dataset size (optional). If not specified, the entire dataset will be used. The size ranges are {300, 450, 600, 750} for EM, {50, 100, 150, 200} For EDT, and {100, 300, 500} for TextCLS--logdir
: the path for TensorBoard logging (F1, acc, precision, and recall)--finetuning
: always keep this flag on--batch_size
,--lr
,--max_len
,--n_epochs
: the batch size, learning rate, max sequence length, and the number of epochs for model training--fp16
: whether to use half-precision for training--lm
: the language model to fine-tune. We currently support bert, distilbert, and roberta--balance
: a special option for binary classification (EM and EDT) with skewed labels (#positive labels >> #negative labels). If this flag is on, then the training process will up-sample the positive labels--warmup
: (new) if this flag is on with SSL, then first warm up the model by training it on labeled data only before running SSL. Only support EM for now.--run_id
: the integer ID of the run e.g., {0, 1, 2, ...}--da
: the data augmentation method (See table below)
Method | Operator Name(s) |
---|---|
No DA (simply LM fine-tuning) | None |
Regular transformation-based DA | ['del', 'drop_col', 'append_col', 'swap', 'ins'] for EM/EDT ['del', 'token_del_tfidf', 'token_del', 'shuffle', 'token_repl', 'token_repl_tfidf'] for TextCLS |
Inversed DA (InvDA) | t5 / invda |
Rotom (w/o semi-supervised learning) | auto_filter_weight_no_ssl |
Rotom (w. semi-supervised learning) | auto_filter_weight |
For the invda fine-tuning, see invda/README.md
.
All experiment scripts are available in scripts/
. To run the experiments for a task (em, cleaning, or textcls):
python scripts/run_all_em.py