BERT_LNL

Code for paper "Is BERT Robust to Label Noise? A Study on Learning with Noisy Labels in Text Classification".

Data Preparation

Datasets

We run our experiments on the following four datasets: AG-News, IMDB, Yorùbá, and Hausa.

Name	# train	# validation	# test	data source
AGNews	108000	12000	7600	LINK
IMDB	21246	3754	2500	LINK
Yorùbá	1340	189	379	LINK
Hausa	2045	290	582	LINK

Preprocessing

Create a data directory as the home directory for all datasets, then create a folder for each dataset in the data directory.

For each dataset, create the following files:

[train/validation/test].txt: one document per line.
[train/validation/test]_labels.pickle files: list of labels (we assume that labels are already encoded in label ids, e.g., [0,0,1,2,3,...,4]).
If there is no validation data for a given dataset, then there is no need to create the corresponding validation files.
Save files in a txt_data directory inside the dataset folder. For example, the full path to the train.txt file of the AGNews dataset is data/AGNews/txt_data/train.txt.

Examples

Run BERT-WN on AG-News, with 20% single-flip label noise:

CUDA_VISIBLE_DEVICES=[CUDA_ID] python3 ../main.py \
--dataset AG-NEWs \
--log_root [LOG_ROOT] \
--data_root [DATA_ROOT] \
--trainer_name bert_wn \
--model_name bert-base-uncased \
--gen_val \
--nl_batch_size 32 \
--eval_batch_size 32 \
--gradient_accumulation_steps 1 \
--max_sen_len 64 \
--lr 0.00002 \
--num_training_steps 3000 \
--patience 25 \
--eval_freq 50 \
--store_model $STORE_MODEL \
--noise_level 0.2 \
--noise_type sflip \
--manualSeed 1234

uds-lsv/BERT-LNL

BERT_LNL

Data Preparation

Datasets

Preprocessing

Examples