This repository contains the code, models and datasets discussed in our paper "Few-Shot Question Answering by Pretraining Span Selection", to appear at ACL 2021.
Our pretraining code is based on TensorFlow (checked on 1.15), while fine-tuning is based on PyTorch (1.7.1) and Transformers (2.9.0). Note each has its own requirement file: pretraining/requirements.txt and finetuning/requirements.txt.
curl -L https://www.dropbox.com/sh/pfg8j6yfpjltwdx/AAC8Oky0w8ZS-S3S5zSSAuQma?dl=1 > mrqa-few-shot.zip
unzip mrqa-few-shot.zip -d mrqa-few-shot
curl -L https://www.dropbox.com/sh/h63xx2l2fjq8bsz/AAC5_Z_F2zBkJgX87i3IlvGca?dl=1 > splinter.zip
unzip splinter.zip -d splinter
Create a virtual environment and execute
cd pretraining
pip install -r requirements.txt # or requirements-gpu.txt for a GPU version
Then download the raw data (our pretraining was based on Wikipedia and BookCorpus). We support two data formats:
- For wiki, a
<doc>
tag starts a new article and a</doc>
ends it. - For BookCorpus, we process an already-tokenized file where tokens are separated by whitespaces. Newlines stands for a new book.
This command takes as input a set of files ($INPUT_PATTERN
) and creates a tensorized dataset for pretraining.
It supports the following masking schemes:
- Masked Language Modeling (Devlin et. al 2019)
- Masked Language Modeling with Geometric Masking (SpanBERT; Joshi et. al 2020). See an example for creating the data for SpanBERT, and for pretraining it.
- Recurring Span Selection (our pretraining scheme)
cd pretraining
python create_pretraining_data.py \
--input_file=$INPUT_PATTERN \
--output_dir=$OUTPUT_DIR \
--vocab_file=vocabs/bert-cased-vocab.txt \
--do_lower_case=False \
--do_whole_word_mask=False \
--max_seq_length=512 \
--num_processes=63 \
--dupe_factor=5 \
--max_span_length=10 \
--recurring_span_selection=True \
--only_recurring_span_selection=True \
--max_questions_per_seq=30
n-gram statistics are written to ngrams.txt
in the output directory.
cd pretraining
python run_pretraining.py \
--bert_config_file=configs/bert-base-cased-config.json \
--input_file=$INPUT_FILE \
--output_dir=$OUTPUT_DIR \
--max_seq_length=512 \
--recurring_span_selection=True \
--only_recurring_span_selection=True \
--max_questions_per_seq=30 \
--do_train \
--train_batch_size=256 \
--learning_rate=1e-4 \
--num_train_steps=2400000 \
--num_warmup_steps=10000 \
--save_checkpoints_steps=10000 \
--keep_checkpoint_max=240 \
--use_tpu \
--num_tpu_cores=8 \
--tpu_name=$TPU_NAME
This can be trained using GPUs by dropping the use_tpu
flag (although it was tested mainly on TPUs).
In order to fine-tune the TF model you pretrained with run_pretraining.py
, you will first need to convert it to
PyTorch. You can do so by
cd model_conversion
pip install -r requirements.txt
python convert_tf_to_pytorch.py --tf_checkpoint_path $TF_MODEL_PATH --pytorch_dump_path $OUTPUT_PATH
Fine-tuning has different requirements than pretraining, as it uses HuggingFace's Transformers library. Create a virtual environment and execute
cd finetuning
pip install -r requirements.txt
Please Note: If you want to reproduce results from the paper or run with a QASS head in genral, questions need to
be augmented with a [QUESTION]
token. In order to do so, please run
cd finetuning
python qass_preprocess.py --path "../mrqa-few-shot/*/*.jsonl"
This will add a [MASK]
token to each question in the training data, which will later be replaced by a
[QUESTION]
token automatically by the QASS layer implementation.
Then fine-tune Splinter by
cd finetuning
export MODEL="../splinter"
export OUTPUT_DIR="output"
python run_mrqa.py \
--model_type=bert \
--model_name_or_path=$MODEL \
--qass_head=True \
--tokenizer_name=$MODEL \
--output_dir=$OUTPUT_DIR \
--train_file="../mrqa-few-shot/squad/squad-train-seed-42-num-examples-16_qass.jsonl" \
--predict_file="../mrqa-few-shot/squad/dev_qass.jsonl" \
--do_train \
--do_eval \
--max_seq_length=384 \
--doc_stride=128 \
--threads=4 \
--save_steps=50000 \
--per_gpu_train_batch_size=12 \
--per_gpu_eval_batch_size=16 \
--learning_rate=3e-5 \
--max_answer_length=10 \
--warmup_ratio=0.1 \
--min_steps=200 \
--num_train_epochs=10 \
--seed=42 \
--use_cache=False \
--evaluate_every_epoch=False
In order to train with automatic mixed precision, install apex and add the --fp16
flag.
See an example script for fine-tuning SpanBERT (rather than Splinter) here.
If you find this work helpful, please cite us
@inproceedings{ram2021fewshot,
author = {Ori Ram and Yuval Kirstain and Jonathan Berant and Amir Globerson and Omer Levy},
booktitle = {Association for Computational Linguistics (ACL)},
title = {Few-Shot Question Answering by Pretraining Span Selection},
url = {https://arxiv.org/abs/2101.00438}
year = {2021},
}