/convert_qa_dataset_to_squad

Forked From https://github.com/jackalhan/qa_datasets_converter. Adjusted for own use

Primary LanguagePythonMIT LicenseMIT

Note: This is a forked repository. I am editing this based on my need.

  1. Fully support TriviaQA (automatically download, read, and write to the JSON-SQuAD format)\

Dataset Converter for Question-Answering (QA) Tasks

Dataset Converter for natural language processing tasks such QA(question-answering) Tasks: from one format to other one

QA Dataset Paper & Data :

Supported Formats :

Source Destination Status
QAngaroo SQuAD completed
MCTest SQuAD completed
WikiQA SQuAD completed
InsuranceQA v1 SQuAD completed
InsuranceQA v2 SQuAD completed
TriviaQA SQuAD completed
NarrativeQA SQuAD completed
MS MARCO SQuAD completed
MS MARCO v2 SQuAD completed
WikiMovies SQuAD on hold
Simple Questions SQuAD on hold
Ubuntu Corpus v2 SQuAD completed
NewsQA SQuAD completed
SQuAD MatchZoo completed
Quasar-T SQuAD completed
Quasar-S SQuAD completed

Example Call :

You can find the sample call for each format type in the executor.py file such as below.

For TriviaQA (Train)

python executor.py \
--log_path="./log/log.log" \
--data_path="./data/triviaqa/" \
--from_files="source:./datasets/triviaqa-rc/qa/wikipedia-train.json, wikipedia:./datasets/triviaqa-rc/evidence/wikipedia,web:./datasets/triviaqa-rc/evidence/web,seed:10,token_size:2000,sample_size:1000000" \
--from_format="triviaqa" \
--to_format="squad" \
--to_file_name="wikipedia-train-long.json"

For TriviaQA (Validation)

python executor.py \
--log_path="./log/log.log" \
--data_path="./data/triviaqa/" \
--from_files="source:./datasets/triviaqa-rc/qa/wikipedia-dev.json, wikipedia:./datasets/triviaqa-rc/evidence/wikipedia,web:./datasets/triviaqa-rc/evidence/web,seed:10,token_size:2000,sample_size:1000000" \
--from_format="triviaqa" \
--to_format="squad" \
--to_file_name="wikipedia-dev-long.json"