The Pytorch implementation of SGSH: Stimulate Large Language Models with Skeleton Heuristics for Knowledge Base Question Generation(NAACL 2024 Findings).
- Create a virtual environment by running the following command:
$ conda env create --name=SGSH --file=environment.yml
- Activate the environment using:
$ conda activate SGSH
Our experiments contain two widely-used datasets, i.e., WebQuestions (WQ) and PathQuestions (PQ). The raw data of these datasets are from GitHub Graph2Seq. You can directly use the datasets in our folder dataset/
.
-
WQ:
dataset/
contains files for the WQ dataset. -
PQ:
dataset/
contains files for the PQ dataset.
More specifically, WQ/
and PQ/
mainly contain the following files:
-
train.json
,dev.json
, andtest.json
are the data for train, dev, and test, respectively. -
train_question_gold.txt
,dev_question_gold.txt
, andtest_question_gold.txt
are the ground-truth questions for train data, dev data, and test data, respectively. -
train_skeleton.txt
anddev_skeleton.txt
are skeleton training data constructed using the automatic training data construction strategy we proposed.
-
Prepare the dataset for the skeleton generator by running the following command. Alternatively, You can directly use the built data in
dataset/WQ/train_skeleton.txt
anddataset/WQ/dev_skeleton.txt
(Note: we take the WQ dataset as an example).- Extract skeletons using the rule-based method, execute:
$ python construct_skeleton_data_by_rules.py --fileName './dataset/WQ' --questionName 'train_question_gold.txt' --skeletonName 'train_skeleton_rules.txt'
- Generate skeletons using a ChatGPT-based skeleton generator, execute:
$ python construct_skeleton_data_by_chatgpt.py --fileName './dataset/WQ' --questionName 'train_question_gold.txt' --skeletonName 'train_skeleton_chatgpt.txt'
- Refine skeletons by ChatGPT-based skeleton quality evaluator, execute:
$ python score_skeleton_by_chatgpt.py --fileName './dataset/WQ' --questionName 'train_question_gold.txt' --skeletonName1 'train_skeleton_rules.txt' --skeletonName2 'train_skeleton_chatgpt.txt' --skeletonScore 'train_skeleton_score_by_chatgpt.txt'
- Prepare training data for training skeleton generator, execute:
$ python process_data.py --input_dir './dataset/WQ' --output_dir './output' --model_name_or_path 'facebook/bart-base'
-
To train the skeleton generator, execute:
$ python skeleton_main.py --input_dir './dataset/WQ' --output_dir './output' --model_name_or_path 'facebook/bart-base' --learning_rate 5e-5 --batch_size 16 --num_train_epochs 20
- To infer and acquire the generated skeleton on the test dataset (i.e., './dataset/WQ/predict_test_skeleton.txt'), execute:
$ python skeleton_main.py --isTrain False --input_dir './dataset/WQ' --output_dir 'output' --model_name_or_path 'facebook/bart-base' --batch_size 16
$ python gpt_test_run.py