Contextualized Sparse Representations for Real-Time Open-Domain Question Answering
This repository provides author's implementation of Contextualized Sparse Representation for Real-Time Open-Domain Question Answering. You can train and evaluate DenSPI+Sparc described in our paper and make your own Sparc vector.
Please install the Conda environment as follows:
$ conda env create -f environment.yml
$ conda activate sparc
Note that this repository is mostly based on DenSPI and DrQA.
We use SQuAD v1.1 for training DenSPI+Sparc. Please download them in $DATA_DIR
.
$ mkdir $DATA_DIR
$ wget https://raw.githubusercontent.com/rajpurkar/SQuAD-explorer/master/dataset/train-v1.1.json -O $DATA_DIR/train-v1.1.json
$ wget https://raw.githubusercontent.com/rajpurkar/SQuAD-explorer/master/dataset/dev-v1.1.json -O $DATA_DIR/dev-v1.1.json
DenSPI is based on BERT. Please download pre-trained weights of BERT under $BERT_DIR
.
$ mkdir $BERT_DIR
$ wget https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased-pytorch_model.bin -O $BERT_DIR/pytorch_model_base_uncased.bin
$ wget https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased-config.json -O $BERT_DIR/bert_config_base_uncased.json
$ wget https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-uncased-pytorch_model.bin -O $BERT_DIR/pytorch_model_large_uncased.bin
$ wget https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-uncased-config.json -O $BERT_DIR/bert_config_large_uncased.json
# Vocabulary is the same for BERT-base and BERT-large.
$ wget https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased-vocab.txt -O $BERT_DIR/vocab.txt
To train DenSPI+Sparc on SQuAD, use train.py
. Trained models will be saved in $OUT_DIR1
.
$ mkdir $OUT_DIR1
# Train with BERT-base
$ python train.py --data_dir $DATA_DIR --metadata_dir $BERT_DIR --output_dir $OUT_DIR1 --bert_model_option 'base_uncased' --train_file train-v1.1.json --predict_file dev-v1.1.json --do_train --do_predict --do_eval
# Train with BERT-large (use smaller train_batch_size for 12GB GPUs)
$ python train.py --data_dir $DATA_DIR --metadata_dir $BERT_DIR --output_dir $OUT_DIR1 --bert_model_option 'large_uncased' --parallel --train_file train-v1.1.json --predict_file dev-v1.1.json --do_train --do_predict --do_eval --train_batch_size 6
The result will look like (in case of BERT-base):
04/28/2020 06:32:59 - INFO - post - num vecs=45059736, num_words=1783576, nvpw=25.2637
04/28/2020 06:33:01 - INFO - __main__ - [Validation] loss: 8.700, b'{"exact_match": 75.10879848628193, "f1": 83.42143097917004}\n'
To use DenSPI+Sparc in an open-domain setting, you have to additionally train it with negative samples. In case of DenSPI+Sparc with BERT-base (same for BERT-large except --bert_model_option
and --parallel
arguments), commands for training on negative samples are:
$ mkdir $OUT_DIR2
$ python train.py --data_dir $DATA_DIR --metadata_dir $BERT_DIR --output_dir $OUT_DIR --bert_model_option 'base_uncased' --train_file train-v1.1.json --predict_file dev-v1.1.json --do_train_neg --do_predict --do_eval --do_load --load_dir $OUT_DIR1 --load_epoch 3
Finally, train the phrase classifer as:
$ mkdir $OUT_DIR3
# Train only 1 epoch for phrase classifier
$ python train.py --data_dir $DATA_DIR --metadata_dir $BERT_DIR --output_dir $OUT_DIR --bert_model_option 'base_uncased' --train_file train-v1.1.json --predict_file dev-v1.1.json --num_train_epochs 1 --do_train_filter --do_predict --do_eval --do_load --load_dir $OUT_DIR2 --load_epoch 3
We also provide a pretrained DenSPI+Sparc as follows:
- DenSPI+Sparc pre-trained on SQuAD - link
Given the pre-trained DenSPI+Sparc, you can get Sparc embedding with following commands. Example below assumes using our pre-trained weight (denspi_sparc.zip
unzipped in denspi_sparc
folder). If you want to use your own model, please modify MODEL_DIR
accordingly.
For any type of text you want to embed, put them in each line of input_examples.txt
. Then run:
$ export DATA_DIR=.
$ export MODEL_DIR=denspi_sparc
$ python train.py --data_dir $DATA_DIR --metadata_dir $BERT_DIR --output_dir $OUT_DIR --predict_file input_examples.txt --parallel --bert_model_option 'large_uncased' --do_load --load_dir $MODEL_DIR --load_epoch 1 --do_embed --dump_file output.json
The result file $OUT_DIR/output.json
will show Sparc embedding of the input text ([CLS] representation, sorted by scores). For instance:
{
"out": [
{
"text": "They defeated the Arizona Cardinals 49-15 in the NFC Championship Game and advanced to their second Super Bowl appearance since the franchise was founded in 1995.",
"sparc": {
"uni": {
"1995": {
"score": 1.6841894388198853,
"vocab": "2786"
},
"second": {
"score": 1.6321970224380493,
"vocab": "2117"
},
"49": {
"score": 1.6075607538223267,
"vocab": "4749"
},
"arizona": {
"score": 1.1734912395477295,
"vocab": "5334"
},
},
"bi": {
"arizona cardinals": {
"score": 1.3190401792526245,
"vocab": "5334, 9310"
},
"nfc championship": {
"score": 1.1005975008010864,
"vocab": "22309, 2528"
},
"49 -": {
"score": 1.0863999128341675,
"vocab": "4749, 1011"
},
"the arizona": {
"score": 0.9722453951835632,
"vocab": "1996, 5334"
},
}
}
}
]
}
Note that each text is segmented by the BERT tokenizer ("vocab"
denotes the BERT vocab index).
To see how Sparc changes for each phrase, set start_index
in here to the target token position. For instance, setting start_index = 17
to embed Sparc of 415,000
of the following text gives you (some n-grams are omitted):
"text": "Between 1991 and 2000, the total area of forest lost in the Amazon rose from 415,000 to 587,000 square kilometres.",
"sparc": {
"uni": {
"1991": {
"score": 1.182684063911438,
"vocab": "2889"
},
"2000": {
"score": 0.41507360339164734,
"vocab": "2456"
},
whereas setting start_index = 21
to embed Sparc of 587,000
gives you:
"text": "Between 1991 and 2000, the total area of forest lost in the Amazon rose from 415,000 to 587,000 square kilometres.",
"sparc": {
"uni": {
"2000": {
"score": 1.1923936605453491,
"vocab": "2456"
},
"1991": {
"score": 0.7090237140655518,
"vocab": "2889"
},
For now, please see the original DenSPI repository or the recent application of DenSPI in COVID-19 domain for building phrase index using DenSPI+Sparc.
The main changes in phrase indexing are in post.py
and mips_phrase.py
where Sparc is used for the open-domain QA inference (See here).
@inproceedings{lee2020contextualized,
title={Contextualized Sparse Representations for Real-Time Open-Domain Question Answering},
author={Lee, Jinhyuk and Seo, Minjoon and Hajishirzi, Hannaneh and Kang, Jaewoo},
booktitle={ACL},
year={2020}
}