KGPT: Knowledge-Grounded Pre-Training

Code and Data for EMNLP2020 Paper KGPT: Knowledge-Grounded Pre-Training for Data-to-Text Generation, this paper proposes a distanly-supervised pre-trainning algorithm to train general data-to-text architectures: 1) sequence KGPT and 2) Graph KGPT. Both of the two models can be applied to a wide range of data-to-text generation tasks. We crawl 7 million distanly-supervised data-to-text data from Wikipedia to pre-train this generation and finetune it on the downstream tasks. The finetuned model can achieve SOTA on multiple datasets and the improvements under few-shot setting are especially dramatic.

Sequence Encoder:

Graph Encoder:

Requirements:

Download Preprocessed Dataset

wget https://kgpt.s3-us-west-2.amazonaws.com/dataset.zip
unzip dataset.zip

If you want to do pre-training, pleaase download the WikiData Graph as well

wget https://kgpt.s3-us-west-2.amazonaws.com/preprocess.zip
unzip preprocess.zip

Download Pre-trained KGPT model

wget https://kgpt.s3-us-west-2.amazonaws.com/models.zip
unzip models.zip

Option1: Finetune on Full Set

Finetune the model on the full downstream dataset

Sequence Encoder

WebNLG

  bash scripts/webnlg/finetune_sequence_webnlg_from_wikidata.sh 0 checkpoint_wikidata/checkpoint_sequence_head8_layer6_GPT2_maxfact12/model_ep14.pt

E2ENLG

  bash scripts/e2enlg/finetune_sequence_e2enlg_from_wikidata.sh 0 checkpoint_wikidata/checkpoint_sequence_head8_layer6_GPT2_maxfact12/model_ep14.pt

Graph Encoder

WebNLG

  bash scripts/webnlg/finetune_graph_e2enlg_from_wikidata.sh 0 checkpoint_wikidata/checkpoint_sequence_head8_layer6_GPT2_maxfact12/model_ep14.pt

E2ENLG

  bash scripts/e2enlg/finetune_graph_e2enlg_from_wikidata.sh 0 checkpoint_wikidata/checkpoint_graph_head8_layer6_GPT2_maxfact12/model_ep14.pt

Option2: Finetune for Few-Shot Leanring

Finetune the model on the 1% downstream dataset

WebNLG

  scripts/webnlg/finetune_sequence_webnlg_from_wikidata_fewshot.sh 0 checkpoint_wikidata/checkpoint_sequence_head8_layer6_GPT2_maxfact12/model_ep14.pt 0.01

E2ENLG

  bash scripts/e2enlg/finetune_sequence_e2enlg_from_wikidata.sh 0 checkpoint_wikidata/checkpoint_sequence_head8_layer6_GPT2_maxfact12/model_ep14.pt 0.01

Model selection

Evaluate all the saved models on the validation set to select the best model.

Sequence Encoder

  bash scripts/webnlg/eval_sequence_webnlg_all.sh 0 test checkpoint_webnlg/checkpoint_finetune_sequence_head8_layer6_GPT2_maxfact12/

Graph Encoder

  bash scripts/webnlg/eval_graph_webnlg_all.sh 0 test checkpoint_webnlg/checkpoint_finetune_graph_head8_layer6_GPT2_maxfact12/

Final test

For example, the model at 20th epoch arrives the best score, then you will output prediction using the following command.

Sequence Encoder

  bash scripts/webnlg/eval_sequence_webnlg.sh 0 challenge checkpoint_webnlg/checkpoint_finetune_sequence_head8_layer6_GPT2_maxfact12/model_ep20.pt

Graph Encoder

  bash scripts/webnlg/eval_graph_webnlg.sh 0 challenge checkpoint_webnlg/checkpoint_finetune_graph_head8_layer6_GPT2_maxfact12/model_ep20.pt

Evaluation

We use the standard e2e evaluation pipeline

  https://github.com/wenhuchen/Data-to-text-Evaluation-Metric.git
  cd Data-to-text-Evaluation-Metric
  ./measure_scores.py ../dataset/webnlg/test.txt ../checkpoint_webnlg/checkpoint_finetune_graph_head8_layer6_GPT2_maxfact12/model_ep20.txt

Reproducing our results

We have released our fine-tuned models in Google Drive. You can simply type in the following command to generate the decoded text file, which replicates the reported score in the paper.

bash scripts/webnlg/eval_sequence_webnlg.sh 0 test checkpoint_webnlg/checkpoint_finetune_sequence_head8_layer6_GPT2_maxfact12_from_ep14/model_ep30.pt

Pre-training

If you want to pre-train the model by yourself, please prepare as many GPUs as you can. Our project uses 8 TITAN RTX GPUs (24G memory) and pre-train on the KGText with batch size of 128 for roughly 10 days. The pre-training can be easily started with the following command:

  bash scripts/wikidata/train_sequence_wikidata_pretraining.sh 0,1,2,3,4,5,6,7

The best performance is normally achieved during 8-14th epoch, the model uses the default setting of 6 layers with 8 heads.

Citation

If you find this project useful, please cite it using the following format

  @article{chen2020kgpt,
  title={KGPT: Knowledge-Grounded Pre-Training for Data-to-Text Generation},
  author={Chen, Wenhu and Su, Yu and Yan, Xifeng and Wang, William},
  journal={Proceedings of EMNLP 2020},
  year={2020}
}

Q&A

If you have any questions about the paper and the github, please feel free to leave an issue or send me an email.

wenhuchen/KGPT

KGPT: Knowledge-Grounded Pre-Training

Requirements:

Download Preprocessed Dataset

Download Pre-trained KGPT model

Option1: Finetune on Full Set

Sequence Encoder

Graph Encoder

Option2: Finetune for Few-Shot Leanring

Model selection

Sequence Encoder

Graph Encoder

Final test

Sequence Encoder

Graph Encoder

Evaluation

Reproducing our results

Pre-training

Citation

Q&A