Language Models as Controlled Natural Language Semantic Parsers for Knowledge Graph Question Answering
This README provides a guide to reproduce our work with different training data sets and pipelines.
The data
folder contains two main files:
train.json
: This file contains the training data for Sparklis, Sparql, and Squall.test.json
: This file contains the test data for the same.
We use three separate pipelines for training:
For Huggingface, choose a model name. Available models include T5, Bloom, GPT-2, GPT-Neo, and GPT-XL.
For OpenAI models like GPT3 Davinci and GPT3 Curie, you need to fine-tune them on the training dataset. For instance, we fine-tuned Davinci for 10 epochs and Curie for 20 epochs with a learning rate of 0.02 and a batch size of 1.
For more detailed instructions, refer to the OpenAI Fine Tuning Guide.
Note: While inferencing on Squall, pass the output from the model to the squall2sparql.sh
for an intermediate query. Pass the output path name as the --output_from_tool_path
command-line argument.
For fine-tuning on custom data using LLAMA2, follow the instructions provided in the LLAMA2 Dataset Guide and LLAMA2 Finetuning.
Also for Llama 2 fine tuning please unzip the llama2_finetuning.zip and paste it in the same directory. After this follow the links provided above.
Install the necessary packages with following command
$ pip install -r requirements.txt
Go to the project directory
$ cd project_name
Run the following python command
$ python ./main.py --pipeline options[openai,hugginface] \
--modelname options[t5,bloom, gpt-2, gpt-neo, gpt-xl] \
--path_to_save_prediction \
--language options[sparklis,squall,sparql] \
--inference_model_name (openAI fine tune model names) \
--output_from_tool_path (output path of squall2sparql tool)
The Squall2Sparql [/tools/squall2sparql.sh] tool is included for the conversion of Squall queries into SPARQL queries, which can then be executed on Wikidata. The running script for this tool is automated using the automate_squall_to_sparql.py file. After the output file is generated by the tool, it is further processed using the Squall parser.
This project is licensed under the CC-BY-4.0 License.