Dutch variants of (English) Natural Language Processing datasets.
โ ๏ธ Warning: Work in progress.
Recreation of Words in Context (WiC) based on DutchSemCor.
WIP
Recreation of The Winograd Schema Challenge (WSC) based on SemEval-2010 Task 1.
WIP
Translation of Choice of Plausible Alternatives (COPA).
Split | Source | Procedure | English | Dutch |
---|---|---|---|---|
train | COPA-dev (first 400)ยน | Google Translate + Human | 400 | 400 |
dev | COPA-dev (last 100)ยน | Google Translate + Human | 100 | 100 |
test | COPA-test | Google Translate + Human | 500 | 500 |
ยน These splits are the same as in SuperGLUE.
Translation of The Stanford Question Answering Dataset (SQuAD).
Split | Source | Procedure | English | Dutch |
---|---|---|---|---|
train | SQuAD-train-v1.1 | Google Translate | 87,599 | 87,599 |
dev | SQuAD-dev-v1.1 \ XQuAD | Google Translate | 9,380 | 9,380 |
test | SQuAD-dev-v1.1 & XQuAD | Google Translate + Human | 1,190 | 1,183 |
Split | Source | Procedure | English | Dutch |
---|---|---|---|---|
train | SQuAD-train-v2.0 | Google Translate | 130,319 | 130,319 |
dev | SQuAD-dev-v2.0 \ XQuAD | Google Translate | 10,174 | 10,174 |
test | SQuAD-dev-v2.0 & XQuAD | Google Translate + Human | 1,699 | 1,699 |
Translation of Sentences Involving Compositional Knowledge (SICK).
Split | Source | Procedure | English | Dutch |
---|---|---|---|---|
train | SICK-train | DeepL + Human | 4,439 | 4,439 |
dev | SICK-trial | DeepL + Human | 495 | 495 |
test | SICK-test | DeepL + Human | 4,906 | 4,906 |