Do we Know What We Don't Know?

Studying Unanswerable Questions beyond SQuAD 2.0

Repository for the paper:

      Do We Know What We Don't Know? Studying Unanswerable Questions beyond SQuAD 2.0
      Elior Sulem, Jamaal Hay and Dan Roth
      Findings of EMNLP 2021

1. Datasets

Existing Datasets Used in the paper:

SQuAD 2.0
- Train Set
- Dev Set
MNLI
- Train Set
- Dev Set

Script for downloading GLUE_DATA: https://gist.github.com/W4ngatang/60c2bdb54d156a41194446737ce03e2e

Link for MNLI (Matched) data alone: https://dl.fbaipublicfiles.com/glue/data/MNLI.zip

New Dataset (released in this repository):

ACE-whQA The corpus is in SQuAD 2.0 format so it can be used with the same code.
- Has Answer: DATA/ACE-whQA/ACE-whQA-has-answer.json
- Compet. IDK: DATA/ACE-whQA/ACE-whQA-IDK-competitive.json
- Non-Compet. IDK: DATA/ACE-whQA/ACE-wkQA-non-competitive.json

License: The dataset is released under the Creative Commons Share-Alike 3.0 license

2. Pretrained Models

Extractive QA:

3. Commands for Training and Testing on SQuAD 2.0 and MNLI:

Setting: TensorFlow using Google Cloud and a single TPU (v2.8)

Creating a virtual environment:

              pip install virtualenv   #install virtualenv
              
              mkdir IDK-Beyond-SQuAD2.0-exeriments

              cd IDK-Beyond-SQuAD2.0-exeriments
              
              virtualenv venv           #create environment
              
              source venv/bin/activate  #activate environment

Installing requirements:

                pip install tensorflow==1.15
                
                pip install bert-tensorflow

                pip install --upgrade google-api-python-client
               
                pip install --upgrade oauth2client

Downloading BERT LARGE CASED model: https://storage.googleapis.com/bert_models/2018_10_18/cased_L-24_H-1024_A-16.zip

The.zip file contains three items:

            A TensorFlow checkpoint (bert_model.ckpt) containing the pre-trained weights (which is actually 3 files).

            A vocab file (vocab.txt) to map WordPiece to word id.

            A config file (bert_config.json) which specifies the hyperparameters of the model.

It should be unzipped to some directory $BERT_MODEL.

Code: https://github.com/google-research/bert
- MNLI: https://github.com/google-research/bert#sentence-and-sentence-pair-classification-tasks
- SQuAD 2.0: https://github.com/google-research/bert#squad-20

CogComp/IDK-beyond-SQuAD2.0