annotations_creators

language_creators

languages

licenses

multilinguality

size_categories

source_datasets

task_categories

task_ids

expert-generated

found

cc-by-nc-sa-3.0

monolingual

1K<n<10K

extended|other-thaiqa

question-answering

extractive-qa

open-domain-qa

Dataset Card for `thaiqa-squad`

Dataset Description
Dataset Structure
Dataset Creation
Considerations for Using the Data
Additional Information

Dataset Description

Homepage: http://github.com/pythainlp/thaiqa_squad (original thaiqa at https://aiforthai.in.th/)
Repository: http://github.com/pythainlp/thaiqa_squad
Paper:
Leaderboard:
**Point of Contact:**http://github.com/pythainlp/ (original thaiqa at https://aiforthai.in.th/)

Dataset Summary

thaiqa_squad is an open-domain, extractive question answering dataset (4,000 questions in train and 74 questions in dev) in SQuAD format, originally created by NECTEC from Wikipedia articles and adapted to SQuAD format by PyThaiNLP.

Supported Tasks and Leaderboards

extractive question answering

Languages

Thai

Dataset Structure

Data Instances

[More Information Needed]

Data Fields

[More Information Needed]

Data Splits

	train	valid
# questions	4000	74
# avg words in context	1186.740750	1016.459459
# avg words in question	14.325500	12.743243
# avg words in answer	3.279750	4.608108

Dataset Creation

Curation Rationale

PyThaiNLP created thaiqa_squad as a SQuAD version of thaiqa. thaiqa is part of The 2nd Question answering program from Thai Wikipedia of National Software Contest 2020.

Source Data

Initial Data Collection and Normalization

[More Information Needed]

Who are the source language producers?

Wikipedia authors for contexts and NECTEC for questions and answer annotations

Annotations

Annotation process

[More Information Needed]

Who are the annotators?

NECTEC

Personal and Sensitive Information

All contents are from Wikipedia. No personal and sensitive information is expected to be included.

Considerations for Using the Data

Social Impact of Dataset

open-domain, extractive question answering in Thai

Discussion of Biases

[More Information Needed]

Other Known Limitations

The contexts include <doc> tags at start and at the end

Additional Information

Dataset Curators

NECTEC for original thaiqa. SQuAD formattting by PyThaiNLP.

Licensing Information

CC-BY-NC-SA 3.0

Citation Information

[More Information Needed]

PyThaiNLP/thaiqa_squad

Dataset Card for `thaiqa-squad`

Table of Contents

Dataset Description

Dataset Summary

Supported Tasks and Leaderboards

Languages

Dataset Structure

Data Instances

Data Fields

Data Splits

Dataset Creation

Curation Rationale

Source Data

Initial Data Collection and Normalization

Who are the source language producers?

Annotations

Annotation process

Who are the annotators?

Personal and Sensitive Information

Considerations for Using the Data

Social Impact of Dataset

Discussion of Biases

Other Known Limitations

Additional Information

Dataset Curators

Licensing Information

Citation Information

PyThaiNLP/thaiqa_squad

Dataset Card for thaiqa-squad

Table of Contents

Dataset Description

Dataset Summary

Supported Tasks and Leaderboards

Languages

Dataset Structure

Data Instances

Data Fields

Data Splits

Dataset Creation

Curation Rationale

Source Data

Initial Data Collection and Normalization

Who are the source language producers?

Annotations

Annotation process

Who are the annotators?

Personal and Sensitive Information

Considerations for Using the Data

Social Impact of Dataset

Discussion of Biases

Other Known Limitations

Additional Information

Dataset Curators

Licensing Information

Citation Information

Dataset Card for `thaiqa-squad`