Chien-Sheng (Jason) Wu, Andrea Madotto, Wenhao Liu, Pascale Fung, Caiming Xiong.
Please cite our work if you use the data or code in this repository
@article{wu2021qaconv,
title={QAConv: Question Answering on Informative Conversations},
author={Wu, Chien-Sheng and Madotto, Andrea and Liu, Wenhao and Fung, Pascale and Xiong, Caiming},
journal={arXiv preprint arXiv:2105.06912},
year={2021}
}
This paper introduces QAConv, a new question answering (QA) dataset that uses conversations as a knowledge source. We focus on informative conversations including business emails, panel discussions, and work channels. Unlike opendomain and task-oriented dialogues, these conversations are usually long, complex, asynchronous, and involve strong domain knowledge. In total, we collect 34,204 QA pairs, including span-based, free-form, and unanswerable questions, from 10,259 selected conversations with both human-written and machine-generated questions. We segment long conversations into chunks, and use a question generator and dialogue summarizer as auxiliary tools to collect multi-hop questions. The dataset has two testing scenarios, chunk mode and full mode, depending on whether the grounded chunk is provided or retrieved from a large conversational pool. Experimental results show that state-of-the-art QA systems trained on existing QA datasets have limited zero-shot ability and tend to predict our questions as unanswerable. Fine-tuning such systems on our corpus can achieve significant improvement up to 23.6% and 13.6% in both chunk mode and full mode, respectively.
Unzip the dataset/data-v1.1-released-on-13-Mar-2022.zip
file and files below are shown under the data folder.
- Question-Answer files (
trn.json
,val.json
,tst.json
)
[
{
"id": "tst-0",
"article_segment_id": "newsdial-1068",
"article_full_id": [
"newsidal-NPR-170"
],
"QG": false,
"question": "Which contact number is available for callers on the line said by NEAL CONAN?",
"answers": [
"800-989-8255"
]
}
]
- Document files (
article_segment.json
,article_full.json
)
{
"newsdial-1068": {
"prev_ctx": [
{
"id": "newsidal-NPR-170-133",
"speaker": "AUBREY JEWETT",
"text": "Up till about a week ago, I was among the many who thought, OK, in the end, Romney's going to pull it out, but I'll tell you, He is in a world of trouble right now in Florida. He may hang on, but Gingrich is really surging in the polls."
}
],
"seg_dialog": [
{
"id": "newsidal-NPR-170-134",
"speaker": "NEAL CONAN, HOST",
"text": "Lucy Morgan, what do you think?"
},
{
"id": "newsidal-NPR-170-135",
"speaker": "LUCY MORGAN",
"text": "I think Romney will pull it out. The newest poll, the better of the polls, Quinnipiac, came out this morning giving Romney a one-point advantage, within the margin of error. But I think the advantage he has is the early vote and the establishment Republicans who are behind him."
},
...
],
"word_count": 204
},
}
{
"newsidal-NPR-170": [
{
"id": "newsidal-NPR-170-0",
"speaker": "NEAL CONAN, HOST",
"text": "This is TALK OF THE NATION. I'm Neal Conan in Orlando. Gabby Giffords bows out of Congress, Michele Bachmann vows to return, Newt reborn in South Carolina, while Santorum struggles to stay afloat. It's Wednesday and time for a..."
},
{
"id": "newsidal-NPR-170-1",
"speaker": "RICK SANTORUM",
"text": "These are not cogent thoughts..."
},
{
...
]
}
You can load our trained QA models using the huggingface library.
- t5-base: Salesforce/qaconv-unifiedqa-t5-base
- t5-large: Salesforce/qaconv-unifiedqa-t5-large
- t5-3B: Salesforce/qaconv-unifiedqa-t5-3b
You can directly run the trained model on any conversations,
from transformers import AutoTokenizer, T5ForConditionalGeneration
model_name = "Salesforce/qaconv-unifiedqa-t5-base" # you can specify the model size here
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = T5ForConditionalGeneration.from_pretrained(model_name)
def run_model(input_string, **generator_args):
generator_args["max_length"] = 20
generator_args["min_length"] = 1
input_ids = tokenizer.encode(input_string, return_tensors="pt")
res = model.generate(input_ids, **generator_args)
return tokenizer.batch_decode(res, skip_special_tokens=True)
For instance, here is how you can use it to answer a question (question and conversation are separated by ):
answer = run_model("Why Salesforce accquire Slack? </s> Jason: Boom! Check the news of Salesforce. Andrea: Wowm don't know why they want to accquire Slack. Jason: This will give them a unified platform for connecting employees, customers and partners. Debbie: How much did they pay? Andrea: $27.7 billion I saw.")
which gives ['To have a unified platform for connecting employees, customers and partners.']
- roberta-large: Salesforce/qaconv-roberta-large-squad2
- bert-large: Salesforce/qaconv-bert-large-uncased-whole-word-masking-squad2
You can directly run the trained model on any conversations,
from transformers import pipeline
qa_pipeline = pipeline("question-answering",model="Salesforce/qaconv-roberta-large-squad2")
For instance, here is how you can use it to answer a question
answer = qa_pipeline(question="Why Salesforce accquire Slack?", context="Jason: Boom! Check the news of Salesforce. Andrea: Wowm don't know why they want to accquire Slack. Jason: This will give them a unified platform for connecting employees, customers and partners. Debbie: How much did they pay? Andrea: $27.7 billion I saw.", handle_impossible_answer=True)
which gives {'score': 0.33785736560821533, 'start': 127, 'end': 194, 'answer': 'a unified platform for connecting employees, customers and partners'}
First, install requirements by pip install -r requirements.txt
.
If you encounter error while installing fairscale with error message AttributeError: type object 'Callable' has no attribute '_abc_registry'
, try pip uninstall typing
then redo the installation.
- Run BM25 (./retriever)
❱❱❱ cd retriever
❱❱❱ ./run_retriver.sh tst
- DPR-wiki
We release the retrieved top-1 results at
./retriever/output_retriever_rank_dpr-wiki.json
. Please check the DPR repository for details.
- Preprocess (./data)
❱❱❱ python convert_txt.py
- Zero-shot (./baseline/free_form/)
❱❱❱ ./run_zs.sh
- Training (./baseline/free_form/finetuning/)
❱❱❱ ./run_finetune.sh 0,1 2 allenai/unifiedqa-t5-base 8
- Inference (./baseline/free_form/finetuning/)
❱❱❱ ./run_eval.sh 0 ../../../data/nmt/ ../../../data/ output/qaconv-allenai/unifiedqa-t5-base/ unifiedqa-t5-base output/qaconv-allenai/unifiedqa-t5-base/prediction/
❱❱❱ ./run_eval.sh 0 ../../../data/nmt-bm25/ ../../../data/ output/qaconv-allenai/unifiedqa-t5-base/ unifiedqa-t5-base-bm25 output/qaconv-allenai/unifiedqa-t5-base/prediction-bm25/
❱❱❱ ./run_eval.sh 0 ../../../data/nmt-dpr/ ../../../data/ output/qaconv-allenai/unifiedqa-t5-base/ unifiedqa-t5-base-dprwiki output/qaconv-allenai/unifiedqa-t5-base/prediction-dprwiki/
- Preprocess (./baseline/span_based)
❱❱❱ cd ./baseline/span_based
❱❱❱ python preproc.py
- Training (./baseline/span_based)
❱❱❱ ./run_qa.sh
- Inference (./baseline/span_based)
❱❱❱ python test_pipe.py --gpu 0
- Evaluate one single prediction file (./)
❱❱❱ python evaluate.py data/tst.json prediction/unifiedqa-t5-base-zeroshot.json
- Evaluate the whole folder with all the prediction files (./)
❱❱❱ python evaluate.py data/tst.json prediction/ --folder
We have used only the publicly available transcripts data and adhere to their guideline, for example, the Media data is for research-purpose only and cannot be used for commercial purpose. As conversations may have biased views, for example, specific political opinions from speakers, the transcripts and QA pairs will likely contain them. The content of the transcripts and summaries only reflect the views of the speakers, not the authors' point-of-views. We would like to remind our dataset users that there could have potential bias, toxicity, and subjective opinions in the selected conversations which may impact model training. Please view the content and data usage with discretion.
If you submit papers on QAConv, please consider sending a pull request to merge your results onto the leaderboard. By submitting, you acknowledge that your results are obtained without training on the val/test split and tuned on the val split not the test split.
P.S. There are two versions of the data (V1.0 and V1.1). The results shown below is from V1.1.
- Zero-Shot
EM | F1 | FZ-R | |
---|---|---|---|
Human Performance | 79.99 | 89.87 | 92.33 |
T5-3B (UnifiedQA) | 59.93 | 73.07 | 78.89 |
T5-Large (UnifiedQA) | 58.81 | 71.67 | 77.72 |
BERT-Large (SQuAD 2.0) | 53.54 | 62.58 | 71.11 |
T5-Base (UnifiedQA) | 51.95 | 65.48 | 73.26 |
RoBERTa-Large (SQuAD 2.0) | 50.78 | 59.73 | 69.11 |
RoBERTa-Base (SQuAD 2.0) | 48.92 | 57.33 | 67.40 |
T5-11B (UnifiedQA) | 44.96 | 61.52 | 68.68 |
DistilBERT-Base (SQuAD 2.0) | 40.04 | 46.90 | 59.62 |
BERT-Base (SQuAD 2.0) | 36.22 | 44.57 | 57.72 |
- Fine-Tune
EM | F1 | FZ-R | |
---|---|---|---|
RoBERTa-Large (SQuAD 2.0) | 67.80 | 78.80 | 83.10 |
T5-3B (UnifiedQA) | 67.41 | 79.41 | 83.64 |
T5-Large (UnifiedQA) | 66.76 | 78.67 | 83.21 |
T5-Base (UnifiedQA) | 64.98 | 76.52 | 81.69 |
BERT-Large (SQuAD 2.0) | 64.93 | 76.65 | 81.27 |
RoBERTa-Base (SQuAD 2.0) | 63.64 | 75.53 | 80.38 |
BERT-Base (SQuAD 2.0) | 58.84 | 71.02 | 77.03 |
DistilBERT-Base (SQuAD 2.0) | 57.28 | 68.88 | 75.39 |
- Zero-Shot
EM | F1 | FZ-R | |
---|---|---|---|
BM25 + T5-3B (UnifiedQA) | 41.76 | 52.68 | 63.54 |
BM25 + T5-Large (UnifiedQA) | 40.62 | 50.87 | 62.10 |
BM25 + BERT-Large (SQuAD 2.0) | 37.09 | 43.44 | 57.21 |
BM25 + T5-Base (UnifiedQA) | 36.47 | 47.11 | 59.22 |
BM25 + RoBERTa-Large (SQuAD 2.0) | 35.54 | 41.50 | 55.79 |
BM25 + RoBERTa-Base (SQuAD 2.0) | 34.61 | 40.74 | 55.37 |
BM25 + DistilBERT-Base (SQuAD 2.0) | 29.36 | 34.09 | 50.35 |
BM25 + BERT-Base (SQuAD 2.0) | 25.84 | 31.52 | 48.28 |
- Fine-Tune
EM | F1 | FZ-R | |
---|---|---|---|
BM25 + T5-3B (UnifiedQA) | 45.86 | 55.17 | 65.76 |
BM25 + RoBERTa-Large (SQuAD 2.0) | 45.59 | 54.42 | 65.23 |
BM25 + T5-Large (UnifiedQA) | 45.34 | 54.49 | 65.47 |
BM25 + BERT-Large (SQuAD 2.0) | 44.50 | 53.48 | 64.21 |
BM25 + T5-Base (UnifiedQA) | 43.95 | 52.96 | 64.22 |
BM25 + RoBERTa-Base (SQuAD 2.0) | 43.18 | 52.64 | 63.62 |
BM25 + BERT-Base (SQuAD 2.0) | 40.02 | 49.39 | 61.13 |
BM25 + DistilBERT-Base (SQuAD 2.0) | 39.39 | 48.38 | 60.46 |
Please create an issue or send an email to wu.jason@salesforce.com for any questions/bugs/etc.