This project is based on pretrained model Randeng-BART-139M-QG-Chinese.
In this project, I implemented answer separation in order to improve the performance of the QG model, as suggested by this paper. To further guide the model's decoding process, I introduced a post-selection method that selects the generated texts with the highest overlap with the given answer as the output. Upon analyzing the data, I discovered significant differences between the training and testing sets, so I employed a data augmentation technique called "pseudo_q" to reduce these gaps. The report is available at here.
cd NLPDL_final
mkdir pretrained_model
mkdir saved_model
mkdir output
# Download pretrained QG model & add to pretrained_model
pretrained_model/bart_zh
from https://huggingface.co/IDEA-CCNL/Randeng-BART-139M-QG-Chinese
# Download pretrained QA model & add to pretrained_model
pretrained_model/QA
from https://huggingface.co/uer/roberta-base-chinese-extractive-qa
python data_preprocess/preprocess_squad_zh.py
python data_preprocess/process_pseudo_q.py
You can create your own training_args to train model and eval model.
An example is shown below
# train_args
{
"do_train":true, # set to true when train
"do_predict":false, # set to true when eval
"output_dir":"saved_model",
"save_strategy":"epoch",
"save_total_limit":5,
"num_train_epochs":5,
"per_device_train_batch_size":16,
"gradient_accumulation_steps":4,
"logging_steps":10,
"learning_rate":3e-04,
"seed":42,
"ans_attn":"none",
"load_pretrain":1, # whether to load the pretrained model or train from scratch
"low_res":100, # low resource senario
"aug":"none", # data augmentation methods from [qa, squad, pseudo_q]
"sep_ans":1, # whether to use answer separation
"model_type":"as", # use bart model or ans-stressed-attention bart model
"load_path":"", # provided for eval
"generation_max_length":25,
"predict_with_generate":true
}
python train.py your_train_args.json
python train.py your_eval_args.json