This repository corresponds to our EMNLP-2021 work Robust Dialogue Utterance Rewriting as Sequence Tagging with some recent improvements, including pretraining with noisy silver data and 10-fold cross validation.
For implementing the results from the EMNLP paper, please directly look at Regular Main Stage Training
.
Silver data is first extracted from a large dialogue dataset, LCCC, for the purposes of pretraining and noisy finetuning.
There are two extracted silver datasets: the first one is formed by extracting the first three turns of multiturn dialogues,
and replace or drop phrases in the third turn which is also found in the first two turns. The second dataset is
one which containsthe hardest instances in the first silver dataset, which is measured by editing distance between the
original third turn and the edited third turn. Refer to the scripts in silver_data_generation
to generate these two datasets.
There are two steps in the pretraining phase, P1 and P2. In P1, the large silver dataset is used to further pretrain a language model.
Two tasks are used in P1, query rewriting and masked language modeling.
In P2, the hard silver dataset and the gold dataset are used to further finetune the base models with only query rewriting as
the training task. Refer to the scripts in pretrain_with_silver
for details.
To conduct training, first make the corresponding folder (e.g., coai_p1p2
) under experiments
.
If you use 10-fold cross validation setting, please create 10 foldes like fold_x
, where x
belongs to [1, 10].
Then modify train.sh
before executing ./train.sh
.
This training script trains only one fold at a time, so you have to manage your GPUs and launch 10 times for the training script.
To conduct regular main stage training, first make the corresponding folder (e.g., coai
) under experiments
.
Then put a params.json
into coai
. The content inside a params.json
file is irrelavant to whether it is regular training or 10-fold cross validation, so you can copy one (e.g., this) into coai
.
Next, check and modify train.sh
. You will modify the command with the comment of # regular training
. It basically removes --restore_point
and sets --fold
to empty. Finally, execute ./train.sh
@article{hao2020robust,
title={Robust Dialogue Utterance Rewriting as Sequence Tagging},
author={Hao, Jie and Song, Linfeng and Wang, Liwei and Xu, Kun and Tu, Zhaopeng and Yu, Dong},
journal={arXiv preprint arXiv:2012.14535},
year={2020}
}