- Fully support TriviaQA (automatically download, read, and write to the JSON-SQuAD format)\
Dataset Converter for natural language processing tasks such QA(question-answering) Tasks: from one format to other one
- SQuAD v1 paper | SQuAD v1 data
- SQuAD v2 paper | SQuAD v2 data (NOTE: SQuAD v2 should be also compatible with this code [NOT TESTED])
- QAngaroo paper | QAngaroo data
- MCTest paper | MCTest data
- WikiQA_paper | WikiQA_data
- InsuranceQA paper | InsuranceQA data v1 - InsuranceQA data v2
- MS_MARCO paper | MS_MARCO data
- WikiMovies
- TriviaQA paper | TriviaQA data
- Simple Questions
- NarrativeQA paper | NarrativeQA data
- Ubuntu Dialogue Corpus v2.0 paper | Ubuntu Dialogue Corpus v2.0 data
- NewsQA paper | NewsQA data
- Quasar data
- MatchZoo Each line is the raw query and raw document text of a document. The format is "label \t query \t document_txt".
Source | Destination | Status |
---|---|---|
QAngaroo | SQuAD | completed |
MCTest | SQuAD | completed |
WikiQA | SQuAD | completed |
InsuranceQA v1 | SQuAD | completed |
InsuranceQA v2 | SQuAD | completed |
TriviaQA | SQuAD | completed |
NarrativeQA | SQuAD | completed |
MS MARCO | SQuAD | completed |
MS MARCO v2 | SQuAD | completed |
WikiMovies | SQuAD | on hold |
Simple Questions | SQuAD | on hold |
Ubuntu Corpus v2 | SQuAD | completed |
NewsQA | SQuAD | completed |
SQuAD | MatchZoo | completed |
Quasar-T | SQuAD | completed |
Quasar-S | SQuAD | completed |
You can find the sample call for each format type in the executor.py
file such as below.
python executor.py \
--log_path="./log/log.log" \
--data_path="./data/triviaqa/" \
--from_files="source:./datasets/triviaqa-rc/qa/wikipedia-train.json, wikipedia:./datasets/triviaqa-rc/evidence/wikipedia,web:./datasets/triviaqa-rc/evidence/web,seed:10,token_size:2000,sample_size:1000000" \
--from_format="triviaqa" \
--to_format="squad" \
--to_file_name="wikipedia-train-long.json"
python executor.py \
--log_path="./log/log.log" \
--data_path="./data/triviaqa/" \
--from_files="source:./datasets/triviaqa-rc/qa/wikipedia-dev.json, wikipedia:./datasets/triviaqa-rc/evidence/wikipedia,web:./datasets/triviaqa-rc/evidence/web,seed:10,token_size:2000,sample_size:1000000" \
--from_format="triviaqa" \
--to_format="squad" \
--to_file_name="wikipedia-dev-long.json"