MMQA Dataset

Multi-domain English-Hindi Question Answering Datset.

The dataset can be downloaded from here.

Details

This multilingual QA dataset is created from the comparable documents of six different domains (Tourism, History, Geography, Environment, Diseases and Economics). Our resources are divided into three sub-resources, which are as follows:

  • Multi-domain Multi-lingual Question-Answer (MMQA) : This dataset contains the question-answer pair in English and Hindi language. The filename “QA_Pairs.tsv” contains the said dataset in tab-separated format. This dataset contains the 5495 question-answer pairs (see our paper for details).
  • Question classification dataset: The question classification dataset comprising of 1,022 questions in English associated with their coarse and fine class label. The file “Question_Classification_Data.tsv” contains the said dataset in tab-separated format.
  • Comparable Corpora: This dataset contains the 500 comparable documents in English and Hindi. The folder name “Comparable Corpora” contains the said dataset.

Reference

If you are using this resource then please cite our paper:

Gupta, Deepak, et al. "MMQA: A Multi-domain Multi-lingual Question-Answering Framework for English and Hindi." Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC-2018). 2018.

@InProceedings{GUPTA18.826,
author = {Deepak Gupta and Surabhi Kumari and Asif Ekbal and Pushpak Bhattacharyya},
title = "{MMQA: A Multi-domain Multi-lingual Question-Answering Framework for English and Hindi}",
booktitle = {Proceedings of the Eleventh International Conference on Language Resources and Evaluation
(LREC 2018)},
year = {2018},
month = {May 7-12, 2018},
address = {Miyazaki, Japan},
editor = {Nicoletta Calzolari (Conference chair) and Khalid Choukri and Christopher Cieri and Thierry Declerck
and Sara Goggi and Koiti Hasida and Hitoshi Isahara and Bente Maegaard and Joseph Mariani and Hélène Mazo
and Asuncion Moreno and Jan Odijk and Stelios Piperidis and Takenobu Tokunaga},
publisher = {European Language Resources Association (ELRA)},
isbn = {979-10-95546-00-9},
languag e = {english}
}

License

The MMQA dataset is distributed under the CC BY-NC-SA license.