noisy-squad

A noisy version of the SQuAD dataset

Distributed under the CC BY-SA 4.0 license

This dataset was built by adding noise to the SQuAD 1.1 dataset ("SQuAD: 100,000+ Questions for Machine Comprehension of Text" by Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev and Percy Liang)

The dataset contains 5 types of noise, inspired on previous work by Belinkov and Bisk (2018):

Natural Noise: Words are replaced by real typing errors of people. To automate this, a collection of word corrections performed by people in web platforms that keep track of edits history was used.
Swap Noise: For each word in the text, one random pair of consecutive characters is swapped (e.g. expression → exrpession).
Middle Random Noise: For each word in the text, all characters are shuffled, except for the first and last characters. (e.g. expression → esroxiespn).
Fully Random Noise: For each word in the text, all characters are shuffled (e.g. expression → rsnixpoees)
Keyboard Typo Noise: For each word in the text, one character is replaced by an adjacent character in traditional English keyboards (e.g. expression → exprwssion).

caspillaga/noisy-squad

noisy-squad