the source code for NLPCC2021 《面向问答领域的数据增强方法》/《Data Augmentation Method for Question Answering》
pip install -r requirements.txt # python>=3.6
maybe not complete
raw_data and unilm is big dir (200MB) to upload, you can try link:https://pan.baidu.com/s/1XL1jOlv4v5UfYHVnVZnMWg password: nfr9
- train QA model
# train bert qa and get predic, set parameters inside
python bert_qa.py
when you train qa model over, set test_file to train_file, get model outputs on train_file to data augmentation later.
-
get wrong predict and generate questions
# cd aug_data/qa_files/xxx, here xxx is 'attack' # get wrong predict file with aug_data/filter_files/xxx/train.select_wrong.txt python filter.py # cd unilm/src # download pretrained model from https://unilm.blob.core.windows.net/ckpt/unilm1-large-cased.bin with bert-large-cased config to unilm/pretrained_data/unilm1_large_cased # download checkpoint from https://drive.google.com/open?id=1JN2wnkSRotwUnJ_Z-AbWwoPdP53Gcfsn to unilm/recent_model/microsoft_release # generate questions based on selected wrong prediction, saved in unilm/recent_model/microsoft_release/qg_model.bin.train bash job.sh # manually move qg_model.bin.train to aug_data/qg_files/xxx/unilm.preds.txt # cd aug_data/qa_files/attack # convert data format for QAMatch python rebuilt.py --switch 0
-
filter with QAMatch/roundTrip
# first, you shold train QAMatch Model # follow raw_data/README.md to get data # cd QAMatch # QAMatch pretrained model: https://huggingface.co/bert-large-uncased-whole-word-masking-finetuned-squad, you should set this in QAMatch/train.py before train model python train.py # use squad1 default, you can change it inside # cd aug_data/qa_files/attack # convert data format for QAMatch python for_textMatch.py # cd QAMatch # use QAMatch model to judge python test.py # cd aug_data/qa_files/attack python rebuilt.py --switch 1 # finally, you will get textMatch/train.after_quality.json, mix up train.json to train.mix.after_quality.json, retrain QA Model use it.