speaker "ids" seem to be non-unique
Opened this issue · 0 comments
AnnaWegmann commented
Hi there,
Looking at the MediaSum dataset, the identifiers for the speakers in the "speaker" list seems to be non-unique. E.g., NPR-7 has the speakers ['PROFITT', 'STEVE PROFFITT', 'Ms. SUSAN STURGILL', 'MADELEINE BRAND, host', 'Ms. BARBARA LEBEY', 'Mr. RANDY HALL'] where I understand 'PROFITT' and 'STEVE PROFFITT' should refer to the same person. I think this probably happens quite a few times (filtering naively for "2-person" dialogs leads to only 22020 interviews (with roughly 14,500 NPR interviews) although this dataset is meant to encompass the only NPR-based INTERVIEW dataset which includes 23,714 2-person dialogs.
Just wondering if you/anyone encountered this and developed a workaround?