sparc: A Python repository from jhyuklee

Sparc

Contextualized Sparse Representations for Real-Time Open-Domain Question Answering

This repository provides author's implementation of Contextualized Sparse Representation for Real-Time Open-Domain Question Answering. You can train and evaluate DenSPI+Sparc described in our paper and make your own Sparc vector.

Environment

Please install the Conda environment as follows:

$ conda env create -f environment.yml
$ conda activate sparc

Note that this repository is mostly based on DenSPI and DrQA.

Resources

We use SQuAD v1.1 for training DenSPI+Sparc. Please download them in $DATA_DIR.

$ mkdir $DATA_DIR
$ wget https://raw.githubusercontent.com/rajpurkar/SQuAD-explorer/master/dataset/train-v1.1.json -O $DATA_DIR/train-v1.1.json
$ wget https://raw.githubusercontent.com/rajpurkar/SQuAD-explorer/master/dataset/dev-v1.1.json -O $DATA_DIR/dev-v1.1.json

DenSPI is based on BERT. Please download pre-trained weights of BERT under $BERT_DIR.

$ mkdir $BERT_DIR
$ wget https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased-pytorch_model.bin -O $BERT_DIR/pytorch_model_base_uncased.bin
$ wget https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased-config.json -O $BERT_DIR/bert_config_base_uncased.json
$ wget https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-uncased-pytorch_model.bin -O $BERT_DIR/pytorch_model_large_uncased.bin
$ wget https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-uncased-config.json -O $BERT_DIR/bert_config_large_uncased.json
# Vocabulary is the same for BERT-base and BERT-large.
$ wget https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased-vocab.txt -O $BERT_DIR/vocab.txt

Model

To train DenSPI+Sparc on SQuAD, use train.py. Trained models will be saved in $OUT_DIR1.

$ mkdir $OUT_DIR1
# Train with BERT-base
$ python train.py --data_dir $DATA_DIR --metadata_dir $BERT_DIR --output_dir $OUT_DIR1 --bert_model_option 'base_uncased' --train_file train-v1.1.json --predict_file dev-v1.1.json --do_train --do_predict --do_eval
# Train with BERT-large (use smaller train_batch_size for 12GB GPUs)
$ python train.py --data_dir $DATA_DIR --metadata_dir $BERT_DIR --output_dir $OUT_DIR1 --bert_model_option 'large_uncased' --parallel --train_file train-v1.1.json --predict_file dev-v1.1.json --do_train --do_predict --do_eval --train_batch_size 6

The result will look like (in case of BERT-base):

04/28/2020 06:32:59 - INFO - post -   num vecs=45059736, num_words=1783576, nvpw=25.2637
04/28/2020 06:33:01 - INFO - __main__ -   [Validation] loss: 8.700, b'{"exact_match": 75.10879848628193, "f1": 83.42143097917004}\n'

To use DenSPI+Sparc in an open-domain setting, you have to additionally train it with negative samples. In case of DenSPI+Sparc with BERT-base (same for BERT-large except --bert_model_option and --parallel arguments), commands for training on negative samples are:

$ mkdir $OUT_DIR2
$ python train.py --data_dir $DATA_DIR --metadata_dir $BERT_DIR --output_dir $OUT_DIR --bert_model_option 'base_uncased' --train_file train-v1.1.json --predict_file dev-v1.1.json --do_train_neg --do_predict --do_eval --do_load --load_dir $OUT_DIR1 --load_epoch 3

Finally, train the phrase classifer as:

$ mkdir $OUT_DIR3
# Train only 1 epoch for phrase classifier
$ python train.py --data_dir $DATA_DIR --metadata_dir $BERT_DIR --output_dir $OUT_DIR --bert_model_option 'base_uncased' --train_file train-v1.1.json --predict_file dev-v1.1.json --num_train_epochs 1 --do_train_filter --do_predict --do_eval --do_load --load_dir $OUT_DIR2 --load_epoch 3

We also provide a pretrained DenSPI+Sparc as follows:

DenSPI+Sparc pre-trained on SQuAD - link

Sparc Embedding

Given the pre-trained DenSPI+Sparc, you can get Sparc embedding with following commands. Example below assumes using our pre-trained weight (denspi_sparc.zip unzipped in denspi_sparc folder). If you want to use your own model, please modify MODEL_DIR accordingly.

For any type of text you want to embed, put them in each line of input_examples.txt. Then run:

$ export DATA_DIR=.
$ export MODEL_DIR=denspi_sparc
$ python train.py --data_dir $DATA_DIR --metadata_dir $BERT_DIR --output_dir $OUT_DIR --predict_file input_examples.txt --parallel --bert_model_option 'large_uncased' --do_load --load_dir $MODEL_DIR --load_epoch 1 --do_embed --dump_file output.json

The result file $OUT_DIR/output.json will show Sparc embedding of the input text ([CLS] representation, sorted by scores). For instance:

{
    "out": [
        {
            "text": "They defeated the Arizona Cardinals 49-15 in the NFC Championship Game and advanced to their second Super Bowl appearance since the franchise was founded in 1995.",
            "sparc": {
                "uni": {
                    "1995": {
                        "score": 1.6841894388198853,
                        "vocab": "2786"
                    },
                    "second": {
                        "score": 1.6321970224380493,
                        "vocab": "2117"
                    },
                    "49": {
                        "score": 1.6075607538223267,
                        "vocab": "4749"
                    },
                    "arizona": {
                        "score": 1.1734912395477295,
                        "vocab": "5334"
                    },
                },
                "bi": {
                    "arizona cardinals": {
                        "score": 1.3190401792526245,
                        "vocab": "5334, 9310"
                    },
                    "nfc championship": {
                        "score": 1.1005975008010864,
                        "vocab": "22309, 2528"
                    },
                    "49 -": {
                        "score": 1.0863999128341675,
                        "vocab": "4749, 1011"
                    },
                    "the arizona": {
                        "score": 0.9722453951835632,
                        "vocab": "1996, 5334"
                    },
                }
            }
        }
    ]
}

Note that each text is segmented by the BERT tokenizer ("vocab" denotes the BERT vocab index).

To see how Sparc changes for each phrase, set start_index in here to the target token position. For instance, setting start_index = 17 to embed Sparc of 415,000 of the following text gives you (some n-grams are omitted):

            "text": "Between 1991 and 2000, the total area of forest lost in the Amazon rose from 415,000 to 587,000 square kilometres.",
            "sparc": {
                "uni": {
                    "1991": {
                        "score": 1.182684063911438,
                        "vocab": "2889"
                    },
                    "2000": {
                        "score": 0.41507360339164734,
                        "vocab": "2456"
                    },

whereas setting start_index = 21 to embed Sparc of 587,000 gives you:

            "text": "Between 1991 and 2000, the total area of forest lost in the Amazon rose from 415,000 to 587,000 square kilometres.",
            "sparc": {
                "uni": {
                    "2000": {
                        "score": 1.1923936605453491,
                        "vocab": "2456"
                    },
                    "1991": {
                        "score": 0.7090237140655518,
                        "vocab": "2889"
                    },

Phrase Index

For now, please see the original DenSPI repository or the recent application of DenSPI in COVID-19 domain for building phrase index using DenSPI+Sparc. The main changes in phrase indexing are in post.py and mips_phrase.py where Sparc is used for the open-domain QA inference (See here).

Reference

@inproceedings{lee2020contextualized,
  title={Contextualized Sparse Representations for Real-Time Open-Domain Question Answering},
  author={Lee, Jinhyuk and Seo, Minjoon and Hajishirzi, Hannaneh and Kang, Jaewoo},
  booktitle={ACL},
  year={2020}
}