A pipeline to automatically generate questions and answers, which pertain to Icelandic culture and/or history, from a dataset.
To run the script you will need a python3 environment, version 3.10.13 or older. Install the required dependencies by running
pip install -r requirements.txt
The input dataset is assumed to be a jsonl file with the following keys for each document: "url", "title" and "text", where "text" is the text to be used for creating questions and answers. An example of such a dataset is the Wikipedia dataset.
The pipeline consists of the following steps:
- Convert the documents to requests for a GPT model.
- Make API calls to a GPT model to generate questions and answers.
- Filter the generated questions and answers based on scores given by GPT.
- Correct spelling and grammar in the questions and answers.
- Add information to the questions and answers.
The pipeline can be run from the command line by running the following command:
python run_qa_pipe.py --dataset-path path/to/dataset
The default GPT model which is used is 'gpt-4-turbo', but this can be changed by including the gpt-model
flag with the relevant GPT model's name:
python run_qa_pipe.py --dataset-path path/to/dataset --gpt-model name-of-gpt-model
In order to make an API call to GPT, you need an API key. Instructions on how to obtain such a key are in the OpenAI API reference.
The pipeline defaults to correcting spelling and grammar in the questions and answers, but it can be skipped by including the skip-correction
flag:
python run_qa_pipe.py --dataset-path path/to/dataset --skip-correction
In order to correct spelling and grammar, the Byte-Level Neural Error Correction Model for Icelandic must be downloaded and placed within this repository.
The pipeline will output the final questions and answers to the file data/queries.jsonl
. The output format is a jsonl file with the following format for each question and answer pair: "question", "answer", "question_id", "question_score", "document_score", "url", "title" and "context". Outputs from each step in the pipeline are written to data/extra-data/
.