/FriendsQA

Question answering on multiparty dialogue

Primary LanguagePythonOtherNOASSERTION

Question Answering on Multiparty Dialogue

Question Answering challenges the machine's ability of answering to queries in different forms. This is a part of the Character Mining project led by the Emory NLP research group. The following shows a multiparty dialogue between Joey and Chandler with 6 questions regarding the contents of the dialogue. Questions could be in 6 forms (what, who, when, where, why, how)

Speaker Utterance
U01 [Scene: Central Perk, Joey is getting a phone number from a woman (Casey) as Chandler watches from the doorway.]
U02 Casey: Here you go.
U03 Joey: Great! All right, so I’ll call you later.
U04 Casey: Great!
U05 Chandler: Hey-Hey-Hey! Who was that?
U06 Joey: That would be Casey. We’re going out tonight.
U07 Chandler: Goin’ out, huh? Wow! Wow! So things didn’t work out with Kathy, huh? Bummer.
U08 Joey: No, things are fine with Kathy. I’m having a late dinner with her tonight, right after my early dinner with Casey.
U09 Chandler: What?
U10 Joey: Yeah-yeah. And the craziest thing is that I just ate a whole pizza by myself!
U11 Chandler: Wait! You’re going out with Kathy!
U12 Joey: Yeah. Why are you getting so upset?
U13 Chandler: Well, I’m upset for you. I mean, dating an endless line of beautiful women must be very unfulfilling for you.
  • Q1: What is Joey going to do with Casey tonight?
  • Q2: Who is Joey getting a phone number from?
  • Q3: When will Joey have dinner with Kathy?
  • Q4: Where are Joey and Chandler?
  • Q5: Why is Chandler upset?
  • Q6: How are things between Joey and Kathy?

Your task is to answering these open-domain questions using contiguous spans from the dialogues. This task is challenging because questions could be in any form and might not contain the exact words from the document.

Dataset

For the generation of the FriendsQA dataset, 1,222 scenes from the first four seasons of the Character Mining dataset are selected. Scenes with fewer than five utterances are discarded (83 of them), and each scene is considered an independent dialogue. FriendQA can be viewed as answer span selection, where questions are asked for some contexts in a dialogue and the model is expected to find certain spans in the dialogue containing answer contents. The dialogue aspects of this dataset, however, make it more challenging than other datasets comprising passages in formal languages. Details could be found in the paper.

  • Latest release: v2.0

Statistics

The data split is based on chronological order of the episodes that is consistent across other Character Mining projects:

Dataset Dialogues Questions Answers Eposides
TRN 973 9,791 16,352 1 - 20
DEV 113 1,189 2,065 21 - 22
TST 136 1,172 1,920 23 - *

Annotation

The format of the data separates the context into several utterances and separates the speakers and utterance for each utterance.

"utterances:": [
                        {
                            "uid": 0,
                            "speakers": [
                                "Ross Geller"
                            ],
                            "utterance": "Breathe ."
                        },
                        {
                            "uid": 1,
                            "speakers": [
                                "Susan Bunch"
                            ],
                            "utterance": "Breathe ."
                        },
                        {
                            "uid": 2,
                            "speakers": [
                                "Carol Willick"
                            ],
                            "utterance": "You 're gon na kill me !"
                        }
  ]

The "qas" field includes questions and the answers. An answer has five keys. The first is "answer_text" which denotes the original text. The second is "utterance_id" which denotes the answer appearing in which utterance. The "inner_start" and "inner_end" denote the answer start and end token position in the corresponding utterance and if answer is the speaker, their values are -1 and the "is_speaker" is set as true .

"qas": [
   {
        "id": "s01_e23_c06_What",
        "question": "What does Ross want to name his son ?",
        "answers": [
        {
            "answer_text": "Jamie",
            "utterance_id": 12,
            "inner_start": 24,
            "inner_end": 24,
            "is_speaker": false
         },
         {
            "answer_text": "Jordie .",
            "utterance_id": 9,
            "inner_start": 21,
            "inner_end": 22,
            "is_speaker": false
         }
         ]
   },
  {
         "id": "s01_e23_c06_Who_Paraphrased",
         "question": "By whom was Ross told to count faster ?",
         "answers": [
         {
             "answer_text": "Carol Willick",
             "utterance_id": 6,
             "inner_start": -1,
             "inner_end": -1,
             "is_speaker": true
         }
         ]
  }
  ]

Citation

Contact