/noisy-squad

A noisy version of the SQuAD dataset

noisy-squad

A noisy version of the SQuAD dataset

Distributed under the CC BY-SA 4.0 license

This dataset was built by adding noise to the SQuAD 1.1 dataset ("SQuAD: 100,000+ Questions for Machine Comprehension of Text" by Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev and Percy Liang)

The dataset contains 5 types of noise, inspired on previous work by Belinkov and Bisk (2018):

  • Natural Noise: Words are replaced by real typing errors of people. To automate this, a collection of word corrections performed by people in web platforms that keep track of edits history was used.
  • Swap Noise: For each word in the text, one random pair of consecutive characters is swapped (e.g. expression → exrpession).
  • Middle Random Noise: For each word in the text, all characters are shuffled, except for the first and last characters. (e.g. expression → esroxiespn).
  • Fully Random Noise: For each word in the text, all characters are shuffled (e.g. expression → rsnixpoees)
  • Keyboard Typo Noise: For each word in the text, one character is replaced by an adjacent character in traditional English keyboards (e.g. expression → exprwssion).