Selective Masking

Source code for "Train No Evil: Selective Masking for Task-Guided Pre-Training"

Download Data

The datasets can be downloaded from https://cloud.tsinghua.edu.cn/d/214217e068c543cd8116/. The datasets need to be put in data/datasets.

Run the Whole Pipeline

Modify config/test.json for input path, output path, BERT model path, GPU usage etc.
run bash scripts/run_all_pipeline.sh .

Run each step

The meaning of each step can be found in the appendix of our paper. The input/output paths are also set in config/test.json. Run python3 convert_config.py config/test.json to convert the .json file to a .sh file.

1 GenePT

We use the training scripts from https://github.com/NVIDIA/DeepLearningExamples/tree/master/PyTorch/LanguageModeling/BERT for general pre-training.

2 Selective Masking

2.1 Finetune BERT

bash scripts/finetune_origin.sh

2.2 Downstream Mask

bash data/create_data_rule/run.sh.

2.3 Train NN

bash scripts/run_mask_model.sh

2.4 In-domain Mask

bash data/create_data_model/run.sh

3 TaskPT

bash scripts/run_pretraining.sh

4 Fine-tune

bash scripts/finetune_ckpt_all_seed.sh
python3 gather_results.py $PATH_TO_THE_FINETUNE_OUTPUT

Cite

If you use the code, please cite this paper:

@inproceedings{gu2020train,
    title={Train No Evil: Selective Masking for Task-Guided Pre-Training},
    author={Yuxian Gu and Zhengyan Zhang and Xiaozhi Wang and Zhiyuan Liu and Maosong Sun},
    year={2020},
    booktitle={Proceedings of EMNLP 2020},
}

TingNLP/SelectiveMasking