/SGSH

The PyTorch implementation of SGSH (NAACL 2024 Findings)

Primary LanguagePython

SGSH

The Pytorch implementation of SGSH: Stimulate Large Language Models with Skeleton Heuristics for Knowledge Base Question Generation(NAACL 2024 Findings).

Requirements

1. Environments

  • Create a virtual environment by running the following command:
$ conda env create --name=SGSH --file=environment.yml
  • Activate the environment using:
$ conda activate SGSH

2. Dataset

Our experiments contain two widely-used datasets, i.e., WebQuestions (WQ) and PathQuestions (PQ). The raw data of these datasets are from GitHub Graph2Seq. You can directly use the datasets in our folder dataset/.

  • WQ: dataset/ contains files for the WQ dataset.

  • PQ: dataset/ contains files for the PQ dataset.

More specifically, WQ/ and PQ/ mainly contain the following files:

  • train.json, dev.json, and test.json are the data for train, dev, and test, respectively.

  • train_question_gold.txt, dev_question_gold.txt, and test_question_gold.txt are the ground-truth questions for train data, dev data, and test data, respectively.

  • train_skeleton.txt and dev_skeleton.txt are skeleton training data constructed using the automatic training data construction strategy we proposed.

Quick Start for Running

1. Fine-tuning Skeleton Generator.

  • Prepare the dataset for the skeleton generator by running the following command. Alternatively, You can directly use the built data in dataset/WQ/train_skeleton.txt and dataset/WQ/dev_skeleton.txt (Note: we take the WQ dataset as an example).

    • Extract skeletons using the rule-based method, execute:
    $ python construct_skeleton_data_by_rules.py --fileName './dataset/WQ' --questionName 'train_question_gold.txt' --skeletonName 'train_skeleton_rules.txt'
    
    • Generate skeletons using a ChatGPT-based skeleton generator, execute:
    $ python construct_skeleton_data_by_chatgpt.py --fileName './dataset/WQ' --questionName 'train_question_gold.txt' --skeletonName 'train_skeleton_chatgpt.txt'
    
    • Refine skeletons by ChatGPT-based skeleton quality evaluator, execute:
    $ python score_skeleton_by_chatgpt.py --fileName './dataset/WQ' --questionName 'train_question_gold.txt' --skeletonName1 'train_skeleton_rules.txt' --skeletonName2 'train_skeleton_chatgpt.txt' --skeletonScore 'train_skeleton_score_by_chatgpt.txt'
    
    • Prepare training data for training skeleton generator, execute:
    $ python process_data.py --input_dir './dataset/WQ' --output_dir './output' --model_name_or_path 'facebook/bart-base'
    
  • To train the skeleton generator, execute:

$ python skeleton_main.py --input_dir './dataset/WQ' --output_dir './output' --model_name_or_path 'facebook/bart-base' --learning_rate 5e-5 --batch_size 16 --num_train_epochs 20
  • To infer and acquire the generated skeleton on the test dataset (i.e., './dataset/WQ/predict_test_skeleton.txt'), execute:
$ python skeleton_main.py --isTrain False --input_dir './dataset/WQ' --output_dir 'output' --model_name_or_path 'facebook/bart-base' --batch_size 16 

2. To infer on GPT-3.5 (e.g., text-davinci-003) to obtain the generated questions, execute:

$ python gpt_test_run.py