/OTTeR

🦦 Source Code for EMNLP-22 findings paper "Mixed-modality Representation Learning and Pre-training for Joint Table-and-Text Retrieval in OpenQA".

Primary LanguagePythonMIT LicenseMIT

OTTeR: Open Table-and-Text Retriever

OTTeR

Source Code for our EMNLP-22 findings paper Mixed-modality Representation Learning and Pre-training for Joint Table-and-Text Retrieval in OpenQA. We open-source a two stage OpenQA system, where it first retrieves relevant table-text blocks and then extract answers from the retrieved evidences.

Repository Structure

  • data_ottqa: this folder contains the original dataset copied from OTT-QA.
  • data_wikitable: this folder contains the crawled tables and linked passages from Wikipedia.
  • preprocessing: this folder contains the data for training, validating and testing a code retriever. The code to obtain data for ablation study in the paper is also included.
  • retrieval: this folder contains the source code for table-text retrieval stage.
  • qa: this folder contains the source code for question answering stage
  • scripts: this folder contains the .py and .sh files to run experiments.
  • preprocessed_data: this folder contains the preprocessed data after preprocessing.
  • BLINK: this folder contains the source code adapted from https://github.com/facebookresearch/BLINK for entity linking.

Requirements

pillow==5.4.1
torch==1.8.0
transformers==4.5.0
faiss-gpu
tensorboard==1.15.0
tqdm
torch-scatter
scikit-learn
scipy
bottle
nltk
sentencepiece
pexpect
prettytable
fuzzywuzzy
dateparser
pathos

We also use apex to support mixed precision training. You can use the following command to install apex.

git clone https://github.com/NVIDIA/apex
cd apex
pip install -v --disable-pip-version-check --no-cache-dir ./

Usage

Step 0: Download dataset

Step0-1: OTT-QA dataset
git clone https://github.com/wenhuchen/OTT-QA.git
cp OTT-QA/release_data/* ./data_ottqa
Step0-2: OTT-QA all tables and passages
cd data_wikitable/
wget https://opendomainhybridqa.s3-us-west-2.amazonaws.com/all_plain_tables.json
wget https://opendomainhybridqa.s3-us-west-2.amazonaws.com/all_passages.json
cd ../

This script will download the crawled tables and linked passages from Wikiepdia in a cleaned format.

Retrieval Part -- OTTeR

Step 1: Preprocess

Step 1-1: Link table cells with passages using BLINK

We strongly suggest you to download our processed linked passage from all_constructed_blink_tables.json and skipping this step 1-1, since it costs too much time. You can download the all_constructed_blink_tables.json.gz file, then unzip it with gunzip and move the json file to ./data_wikitable. After that, go to step 2-2 to preprocess. (You can also use the linked passages all_constructed_tables.json following OTT-QA)

If you want to link by yourself, you can run the following script:

cd scripts/
for i in {0..7}
	do
	echo "Starting process", ${i}
	CUDA_VISIBLE_DEVICES=$i python link_prediction_blink.py --shard ${i}@8 --do_all --dataset ../data_wikitable/all_plain_tables.json --data_output ../data_wikitable/all_constructed_blink_tables.json 2>&1 |tee ./logs/eval$i@8.log &
done

Linking using the above script takes about 40-50 hours with 8 Tesla V100 32G GPUs. After linking, you can merge the 8 split files all_constructed_blink_tables_${i}@8.json into one json file all_constructed_blink_tables.json.

Step 1-2: Preprocess training data for retrieval
python retriever_preprocess.py --split train --nega intable_contra --aug_blink
python retriever_preprocess.py --split dev --nega intable_contra --aug_blink

These two scripts create data used for training.

Step 1-3: Build retrieval corpus
python corpus_preprocess.py

This script encode the whole corpus table-text blocks used for retrieval.

Step 1-4: Download tbid2doc file

Download the tfidf_augmentation_results.json.gz file here, then use the following command to unzip and move the unzipped json file to ./data_wikitable. This file will be used for preprocessing in step 4-3 and step 5-1.

gunzip tfidf_augmentation_results.json.gz

Step2: Pretrain the OTTeR with mixed-modality synthetic pre-training

This step we pre-train the OTTeR with BART generated mixed-modality synthetic corpus. You have two choices here.

(1) Skip Step2 and jump to Step3. In this way, you just need to remove the argument --init_checkpoint ${PRETRAIN_MODEL_PATH}/checkpoint_best.pt in the training script in Step3.

(2) Download the pre-trained checkpoint. The pre-trained checkpoint can be found here. You can download it and use to following command to unzip, and then move the repo to ${BASIC_PATH}/models

unzip -d ./checkpoint-pretrain checkpoint-pretrain.zip 

Step 3: Train the OTTeR

If you don't want to use pretrained checkpoint with our proposed Mixed-Modality Synthetic Pretraining, you can remove the line --init_checkpoint ${PRETRAIN_MODEL_PATH}/checkpoint_best.pt \ in the command and run the remaining script.

export RUN_ID=0
export BASIC_PATH=.
export DATA_PATH=${BASIC_PATH}/preprocessed_data/retrieval
export TRAIN_DATA_PATH=${BASIC_PATH}/preprocessed_data/retrieval/train_intable_contra_blink_row.pkl
export DEV_DATA_PATH=${BASIC_PATH}/preprocessed_data/retrieval/dev_intable_contra_blink_row.pkl
export RT_MODEL_PATH=${BASIC_PATH}/models/otter
export PRETRAIN_MODEL_PATH=${BASIC_PATH}/models/checkpoint-pretrain/
export TABLE_CORPUS=table_corpus_blink
mkdir ${RT_MODEL_PATH}

cd ./scripts
CUDA_VISIBLE_DEVICES="0,1,2,3,4,5,6,7" python train_1hop_tb_retrieval.py \
  --do_train \
  --prefix ${RUN_ID} \
  --predict_batch_size 800 \
  --model_name roberta-base \
  --shared_encoder \
  --train_batch_size 64 \
  --fp16 \
  --init_checkpoint ${PRETRAIN_MODEL_PATH}/checkpoint_best.pt \
  --max_c_len 512 \
  --max_q_len 70 \
  --num_train_epochs 20 \
  --warmup_ratio 0.1 \
  --train_file ${TRAIN_DATA_PATH} \
  --predict_file ${DEV_DATA_PATH} \
  --output_dir ${RT_MODEL_PATH} \
  2>&1 |tee ./retrieval_training.log

The training step takes about 10~12 hours with 8 Tesla V100 16G GPUs.

Step 4: Evaluate retrieval performance

Step 4-1: Encode table corpus and dev. questions with OTTeR

Encode dev questions.

cd ./scripts
CUDA_VISIBLE_DEVICES="0,1,2,3" python encode_corpus.py \
    --do_predict \
    --predict_batch_size 100 \
    --model_name roberta-base \
    --shared_encoder \
    --predict_file ${BASIC_PATH}/data_ottqa/dev.json \
    --init_checkpoint ${RT_MODEL_PATH}/checkpoint_best.pt \
    --embed_save_path ${RT_MODEL_PATH}/indexed_embeddings/question_dev \
    --fp16 \
    --max_c_len 512 \
    --num_workers 8  2>&1 |tee ./encode_corpus_dev.log

Encode table-text block corpus. It takes about 3 hours to encode.

CUDA_VISIBLE_DEVICES="0,1,2,3,4,5,6,7" python encode_corpus.py \
    --do_predict \
    --encode_table \
    --shared_encoder \
    --predict_batch_size 1600 \
    --model_name roberta-base \
    --predict_file ${DATA_PATH}/${TABLE_CORPUS}.pkl \
    --init_checkpoint ${RT_MODEL_PATH}/checkpoint_best.pt \
    --embed_save_path ${RT_MODEL_PATH}/indexed_embeddings/${TABLE_CORPUS} \
    --fp16 \
    --max_c_len 512 \
    --num_workers 24  2>&1 |tee ./encode_corpus_table_blink.log
Step 4-2: Build index and search with FAISS

The reported results are table recalls.

python eval_ottqa_retrieval.py \
	 --raw_data_path ${BASIC_PATH}/data_ottqa/dev.json \
	 --eval_only_ans \
	 --query_embeddings_path ${RT_MODEL_PATH}/indexed_embeddings/question_dev.npy \
	 --corpus_embeddings_path ${RT_MODEL_PATH}/indexed_embeddings/${TABLE_CORPUS}.npy \
	 --id2doc_path ${RT_MODEL_PATH}/indexed_embeddings/${TABLE_CORPUS}/id2doc.json \
     --output_save_path ${RT_MODEL_PATH}/indexed_embeddings/dev_output_k100_${TABLE_CORPUS}.json \
     --beam_size 100  2>&1 |tee ./results_retrieval_dev.log
Step 4-3: Generate retrieval output for stage-2 question answering.

This step also evaluates the table block recall defined in our paper. We use the top 15 table-text blocks for QA, i.e.,CONCAT_TBS=15 .

export CONCAT_TBS=15
python ../preprocessing/qa_preprocess.py \
     --split dev \
     --topk_tbs ${CONCAT_TBS} \
     --retrieval_results_file ${RT_MODEL_PATH}/indexed_embeddings/dev_output_k100_${TABLE_CORPUS}.json \
     --qa_save_path ${RT_MODEL_PATH}/dev_preprocessed_${TABLE_CORPUS}_k100cat${CONCAT_TBS}.json \
     2>&1 |tee ./preprocess_qa_dev_k100cat${CONCAT_TBS}.log;

QA part -- Longformer Reader

As we mainly focus on improving retrieval accuracy in this paper, we use the state-of-the-art reader model to evaluate downstream QA performance.

Step 5: Train the QA model

Step 5-1: Create training data

As we mentioned in our paper, to balance the distribution of training data and inference data, we also takes k table-text blocks for training, which contains several ground-truth blocks and the rest of retrieved blocks. We use the following scripts to obtain the training data.

CUDA_VISIBLE_DEVICES="0,1,2,3,4,5,6,7" python encode_corpus.py \
    --do_predict \
    --predict_batch_size 200 \
    --model_name roberta-base \
    --shared_encoder \
    --predict_file ${BASIC_PATH}/data_ottqa/train.json \
    --init_checkpoint ${RT_MODEL_PATH}/checkpoint_best.pt \
    --embed_save_path ${RT_MODEL_PATH}/indexed_embeddings/question_train \
    --fp16 \
    --max_c_len 512 \
    --num_workers 16  2>&1 |tee ./encode_corpus_train.log

python eval_ottqa_retrieval.py \
	   --raw_data_path ${BASIC_PATH}/data_ottqa/train.json \
	   --eval_only_ans \
	   --query_embeddings_path ${RT_MODEL_PATH}/indexed_embeddings/question_train.npy \
	   --corpus_embeddings_path ${RT_MODEL_PATH}/indexed_embeddings/${TABLE_CORPUS}.npy \
	   --id2doc_path ${RT_MODEL_PATH}/indexed_embeddings/${TABLE_CORPUS}/id2doc.json \
	   --output_save_path ${RT_MODEL_PATH}/indexed_embeddings/train_output_k100_${TABLE_CORPUS}.json \
	   --beam_size 100  2>&1 |tee ./results_retrieval_train.log

python ../preprocessing/qa_preprocess.py \
	    --split train \
	    --topk_tbs 15 \
	    --retrieval_results_file ${RT_MODEL_PATH}/indexed_embeddings/train_output_k100_${TABLE_CORPUS}.json \
	    --qa_save_path ${RT_MODEL_PATH}/train_preprocessed_${TABLE_CORPUS}_k100cat${CONCAT_TBS}.json \
	    2>&1 |tee ./preprocess_qa_train_k100.log

Note that we also requires the retrieval output for dev. set. You can refer to Step 4-3 to obtain the processed qa data.

Step 5-2: Train
export BASIC_PATH=.
export MODEL_NAME=mrm8488/longformer-base-4096-finetuned-squadv2
export TOPK=15
export QA_MODEL_PATH=${BASIC_PATH}/models/qa_longformer_${TOPK}_squadv2
mkdir ${QA_MODEL_PATH}

CUDA_VISIBLE_DEVICES="0,1,2,3,4,5,6,7" python train_final_qa.py \
    --do_train \
    --do_eval \
    --model_type longformer \
    --dont_save_cache \
    --overwrite_cache \
    --model_name_or_path ${MODEL_NAME} \
    --evaluate_during_training \
    --data_dir ${RT_MODEL_PATH} \
    --output_dir ${QA_MODEL_PATH} \
    --train_file train_preprocessed_${TABLE_CORPUS}_k100cat${CONCAT_TBS}.json \
    --dev_file dev_preprocessed_${TABLE_CORPUS}_k100cat${CONCAT_TBS}.json \
    --per_gpu_train_batch_size 2 \
    --per_gpu_eval_batch_size 8 \
    --learning_rate 1e-5 \
    --num_train_epochs 4 \
    --max_seq_length 4096 \
    --doc_stride 1024 \
    --topk_tbs ${TOPK} \
    2>&1 | tee ./train_qa_longformer-base-top${TOPK}.log

Step 6: Evaluating the QA performance

export BASIC_PATH=.
export TOPK=15
export QA_MODEL_PATH=${BASIC_PATH}/models/qa_longformer_${TOPK}_squadv2

CUDA_VISIBLE_DEVICES="0,1,2,3,4,5,6,7" python train_final_qa.py \
    --do_eval \
    --model_type longformer \
    --dont_save_cache \
    --overwrite_cache \
    --model_name_or_path ${MODEL_NAME} \
    --data_dir ${RT_MODEL_PATH} \
    --output_dir ${QA_MODEL_PATH} \
    --dev_file dev_preprocessed_${TABLE_CORPUS}_k100cat${CONCAT_TBS}.json\
    --per_gpu_eval_batch_size 16 \
    --max_seq_length 4096 \
    --doc_stride 1024 \
    --topk_tbs ${TOPK} \
    2>&1 | tee ./test_qa_longformer-base-top${TOPK}.log

Reference

If you find our code useful to you, please cite it using the following format:

@inproceedings{huang-etal-2022-mixed,
    title = "Mixed-modality Representation Learning and Pre-training for Joint Table-and-Text Retrieval in {O}pen{QA}",
    author={Huang, Junjie and Zhong, Wanjun and Liu, Qian and Gong, Ming and Jiang, Daxin and Duan, Nan},
    booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2022",
    month = dec,
    year = "2022",
    address = "Abu Dhabi, United Arab Emirates",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2022.findings-emnlp.303",
    pages = "4117--4129",
}

You can also check our another paper focusing on reasoning

@inproceedings{Zhong2022ReasoningOH,
  title={Reasoning over Hybrid Chain for Table-and-Text Open Domain Question Answering},
  author={Wanjun Zhong and Junjie Huang and Qian Liu and Ming Zhou and Jiahai Wang and Jian Yin and Nan Duan},
  booktitle={IJCAI},
  year={2022}
}