UnsupervisedQA

Code, Data and models supporting the experiments in the ACL 2019 Paper: Unsupervised Question Answering by Cloze Translation.

Obtaining training data for Question Answering (QA) is time-consuming and resource-intensive, and existing QA datasets are only available for limited domains and languages. In this work, we take some of the first steps towards unsupervised QA, and develop an approach that, without using the SQuAD training data at all, achieves 56.4 F1 on SQuAD v1.1, and 64.5 F1 when the answer is a named entity mention.

This repository provides code to run pre-trained models to generate sythetic question answering question data. We also make a very large synthetic training dataset for extractive question answering available.

Dataset Downloads

We make available a dataset of 4 million SQuAD-like question answering datapoints, automatically generated by the unsupervised system described in the system.

The data can be downloaded here. The data is in the SQuAD v1 format, and contains:

Fold	# Paragraphs	# QA pairs
`unsupervised_qa_train.json`	782,556	3,915,498
`unsupervised_qa_dev.json`	1,000	4,795
`unsupervised_qa_test.json`	1,000	4,804

Using this training data to fine-tune BERT-Large for reading comprehension will achieve over 50.0 F1 on the SQuAD V1.1 development set using an appropriate early stopping strategy on the unsupervised_qa dev set.

Models and Code

In addition the above data, this repository provides functionality to generate synthetic training data from user-provided documents

Installation:

The code is built to run on top of UnsupervisedMT, and requires all of its its dependencies. Additional requirements are spaCy (for NER and noun chunking), attrs, and NLTK and allennlp (for constituency parsing). It was developed to run on Ubuntu Linux 18.04 and Python 3.7, with CUDA 9

(Optionally) Create a conda environment to keep things clean:

conda create -n uqa37 python=3.7 && conda activate uqa37

The recommended way to install is shown below, which should install and handle all dependencies:

# clone the repo
git clone https://github.com/facebookresearch/UnsupervisedQA.git
cd UnsupervisedQA

# install python dependencies:
pip install -r requirements.txt

# install UnsupervisedMT and its dependencies
./install_tools.sh

Models:

Four UNMT models are made available for download

Sentence Cloze boundaries, Noun Phrase Answers
Sentence Cloze boundaries, Named Entity Answers
Sub-clause Cloze boundaries, Named Entity Answers
Sub-cluase Cloze boundaries, Named Entity Answers, Wh Heuristics (best downstream performance)

The models can be downloaded using the script:

./download_models.sh

This will download all the models and unzip them to the appropriate directory. Each unzipped model is about 850MB, so total space requirement is 3.5GB.

Usage:

You can generate reading comprehension training data using unsupervisedqa.generate_synthetic_qa_data

This script will allow you to generate unsupervised question answering data using the identity, noisy cloze or unsupervised NMT methods explored in the paper, as well as specifying several different configurations (i.e. whether to use subclause shortening, use named entity answers and whether to use the wh heuristic).

This script provides the following command line arguments:

usage: generate_synthetic_qa_data.py [-h] [--input_file_format {txt,jsonl}]
                                     [--output_file_formats OUTPUT_FILE_FORMATS]
                                     [--translation_method {identity,noisy_cloze,unmt}]
                                     [--use_subclause_clozes]
                                     [--use_named_entity_clozes]
                                     [--use_wh_heuristic]
                                     input_file output_file

Generate synthetic training data for extractive QA tasks without supervision

positional arguments:
  input_file            input file, see readme for formatting info
  output_file           Path to write generated data to, see readme for
                        formatting info

optional arguments:
  -h, --help            show this help message and exit
  --input_file_format {txt,jsonl}
                        input file format, see readme for more info, default
                        is txt
  --output_file_formats OUTPUT_FILE_FORMATS
                        comma-seperated list of output file formats, from
                        [jsonl, squad], an output file will be created for
                        each format. Default is 'jsonl,squad'
  --translation_method {identity,noisy_cloze,unmt}
                        define the method to generate clozes -- either the
                        Unsupervised NMT method (unmt), or the identity or
                        noisy cloze baseline methods. UNMT is recommended for
                        downstream performance, but the noisy_cloze is
                        relatively stong on downstream QA and fast to
                        generate. Default is unmt
  --use_subclause_clozes
                        pass this flag to shorten clozes with constituency
                        parsing instead of using sentence boundaries
                        (recommended for downstream performance)
  --use_named_entity_clozes
                        pass this flag to use named entity answer prior
                        instead of noun phrases (recommended for downstream
                        performance)
  --use_wh_heuristic    pass this flag to use the wh-word heuristic
                        (recommended for downstream performance). Only
                        compatable with named entity clozes

The input format is specified by the --input_file format argument, and can either be a .txt file of paragraphs, one per line, for questions and answers to be generated from, or a .jsonl file with each line containing a json-serialised dict of the format {"text": text of paragraph, "paragraph_id" : your unique identifier for the paragraph}

The output format can be specified by the user using the --output_file_formats argument. The user can choose between jsonl and squad format. Requesting the squad format will output a file using the squad v1.1 format, ready to be plugged into downstream extractive QA tasks. The jsonl format provides more metadata than the squad format, the fields are explained below:

{
    "cloze_id": unique identifier for this datapoint
    "paragraph": data on the paragraph this datapoint was generated from
    "source_text": the text from the paragraph the cloze was generated from
    "source_start": character index in paragraph where "source_text" starts
    "cloze_text": the text of the cloze question the question is generated from
    "answer_text": the answer text of the (cloze) question
    "answer_start": the character index that the answer starts at in the paragraph
    "constituency_parse": the constituency parse of the "source_text" if available, otherwise null,
    "root_label": the node label of the root of the constituency parse if available, otherwise null,
    "answer_type": The named entity label of the answer (if using named entity clozes) otherwise "NOUNPHRASE"
    "question_text": the text of the natural question, translated from "cloze_text"
}

A working example to produce unsupervised NMT-translated questions using the model trained with wh heuristics, named entity answers, subclause shortening is below:

python -m unsupervisedqa.generate_synthetic_qa_data example_input.txt example_output \
    --input_file_format "txt" \
    --output_file_format "jsonl,squad" \
    --translation_method unmt \
    --use_named_entity_clozes \
    --use_subclause_clozes \
    --use_wh_heuristic

I'm running out of GPU memory

The repository requires a CUDA-enabled GPU (this is a requirement of UnsupervisedMT), but you can reduce the amount of GPU memory required by adjusting the batch sizes. This can be done by modifying unsupervisedqa/configs.py file, adjusting CONSTITUENCY_BATCH_SIZE and UNMT_BATCH_SIZE.

Training Your own question translation models

This repository only provides functionality to run pre-trained unsupervised question translation models in the paper. For users who want to train new question translation models, they should use the training functionality in UnsupervisedMT, or consider the newer and more powerful XLM repository.

To train question translation models in UnsupervisedMT, first prepare large corpora of cloze questions (potentially using the functionality in this repository) and a large corpus of natural questions. Preprocess these corpora by adapting UnsupervisedMT/NMT/get_data_enfr.sh, and train using the example script in UnsupervisedMT/README, with appropriate edits to the args (e.g en->cloze and fr->question) and paths.

References

Please cite [1] and [2] if you found the resources in this repository useful.

Unsupervised Question Answering by Cloze Translation

[1] P. Lewis, L. Denoyer, S. Riedel Unsupervised Question Answering by Cloze Translation

@inproceedings{lewis2019unsupervisedqa,
  title={Unsupervised Question Answering by Cloze Translation},
  author={Lewis, Patrick and Denoyer, Ludovic and Riedel, Sebastian},
  booktitle={Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)},
  year={2019}
}

Phrase-Based & Neural Unsupervised Machine Translation

[2] G. Lample, M. Ott, A. Conneau, L. Denoyer, MA. Ranzato Phrase-Based & Neural Unsupervised Machine Translation

@inproceedings{lample2018phrase,
  title={Phrase-Based \& Neural Unsupervised Machine Translation},
  author={Lample, Guillaume and Ott, Myle and Conneau, Alexis and Denoyer, Ludovic and Ranzato, Marc'Aurelio},
  booktitle={Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP)},
  year={2018}
}

License

See the LICENSE file for more details.

Troubleshooting

If you run into problems with installing dependencies (particularly allennlp) installing libffi may help:

apt-get install libffi6 libffi-dev

lyuchenyang/UnsupervisedQA