Dataset Converter for natural language processing tasks such QA(question-answering) Tasks: from one format to other one
- SQuAD v1 paper | SQuAD v1 data
- SQuAD v2 paper | SQuAD v2 data (NOTE: SQuAD v2 should be also compatible with this code [NOT TESTED])
- QAngaroo paper | QAngaroo data
- MCTest paper | MCTest data
- WikiQA_paper | WikiQA_data
- InsuranceQA paper | InsuranceQA data v1 - InsuranceQA data v2
- MS_MARCO paper | MS_MARCO data
- WikiMovies
- TriviaQA paper | TriviaQA data
- Simple Questions
- NarrativeQA paper | NarrativeQA data
- Ubuntu Dialogue Corpus v2.0 paper | Ubuntu Dialogue Corpus v2.0 data
- NewsQA paper | NewsQA data
- Quasar data
- MatchZoo Each line is the raw query and raw document text of a document. The format is "label \t query \t document_txt".
Source | Destination | Status |
---|---|---|
QAngaroo | SQuAD | completed |
MCTest | SQuAD | completed |
WikiQA | SQuAD | completed |
InsuranceQA v1 | SQuAD | completed |
InsuranceQA v2 | SQuAD | completed |
TriviaQA | SQuAD | completed |
NarrativeQA | SQuAD | completed |
MS MARCO | SQuAD | completed |
MS MARCO v2 | SQuAD | completed |
WikiMovies | SQuAD | on hold |
Simple Questions | SQuAD | on hold |
Ubuntu Corpus v2 | SQuAD | completed |
NewsQA | SQuAD | completed |
SQuAD | MatchZoo | completed |
Quasar-T | SQuAD | completed |
Quasar-S | SQuAD | completed |
You can find the sample call for each format type in the executor.py
file such as below.
python executor.py
--log_path="~/log.log"
--data_path="~/data/"
--from_files="source:question.train.token_idx.label,voc:vocabulary,answer:answers.label.token_idx"
--from_format="insuranceqa"
--to_format="squad"
--to_file_name="filename.what" # it is gonna be renamed as "[from_to]_filename.what"