Learning from Multi-SourceWeak Supervision for Deep Text Classification
Python 3.6
The code can be run on either CPU or GPU environment.
To run the model, first unzip the dataset file, and then using either way is fine:
(Note: due to the github space limitation, we only include the three dataset. The entire dataset can be downloaded using: https://drive.google.com/drive/folders/1MJe1BJYNPudfmpFxCeHwYqXMx53Kv4h_?usp=sharing)
(1) python main_conditional_attn.py --ds {$dataset}
(For example: python main_conditional_attn.py --ds imdb)
usage: main_conditional_attn.py [-h] [--pt_file PT_FILE] --ds
{youtube,imdb,yelp,agnews,spouse} [--no_cuda]
[--fast_mode] [--seed SEED] [--epoch EPOCH]
[--lr LR] [--weight_decay WEIGHT_DECAY]
[--hidden HIDDEN] [--c2 C2] [--c3 C3]
[--k K] [--x0 X0]
[--unlabeled_ratio UNLABELED_RATIO]
[--log_prefix LOG_PREFIX] [--ft_log FT_LOG]
[--n_high_cov N_HIGH_COV]
(2) sh run.sh
The trained model will be stored at the model folder.
The running details output can be found at log_files folder.
The test accuracy can be found at ft_logs folder.
Dataset:
- agnews
- imdb
- spouse
- yelp
- youtube
The required data are stored as *.pt file, and each record includes the following information:
- the original document text ('text')
- the extracted pre-trained Transformer feature ('bert_feature'')
- the ground truth label ('label')
- the annotated noisy labels ('lf')
- the simple majority voting label of annotated noisy labels ('major_label')
We use a dictionary to store the training, validation, and test data. The division are maintained the same for all the baselines as well.
data_dict = {
'labeled': {
'text':
'label':
'major_label':
'lf':
'bert_feature':
},
'unlabeled': {
'text':
'label':
'major_label':
'lf':
'bert_feature':
},
'test': {
'text':
'label':
'major_label':
'lf':
'bert_feature':
},
'validation': {
'text':
'label':
'major_label':
'lf':
'bert_feature':
}
}
We provide Labeling Functions and Labeling Results of each dataset in the rules-noisy-labels folder. The specific description is included in the inside README.