/AMICorpusXML

Extract meetings transcript and summary from AMI Corpus

Primary LanguagePythonMIT LicenseMIT

About

  • Extracts meetings transcript and summary from AMI Corpus
  • Transforms into CNN-DailyMail News dataset (.story files with article and highlight in it)

Contents

AMI CorpusStory DatasetHow to Make

AMI Corpus

  • Number of meetings (including scenario and non-scenario): 171
    • Number of speakers per meeting: 4-5
    • Total number of transcripts: 687
  • Number of summaries: 142
    • Abstract info is only available for meetings with names starting with ES, IS and TS

Story Dataset

Already made .story dataset has been provided under data/ami-transcripts-stories/

How to Make

Make dataset from scratch: download AMI Corpus and extract .story files

python main_extract_meeting_text.py

Configuration options

Argument Type Default
ami_xml_dir string "data/"
results_transcripts_speaker_dir string "data/ami-transcripts-speaker/"
results_transcripts_dir string "data/ami-transcript/"
results_summary_dir string "data/ami-summary/"
  • ami_xml_dir is the directory where the AMI Corpus will be downloaded
  • results_transcripts_speaker_dir is the directory where each speaker's transcript will be saved
  • results_transcripts_dir is the directory where each meeting's transcript will be saved
  • results_summary_dir is the directory where each meeting's summary will be saved

AMI Corpus final output structure

    assets
    +-- ami-summary 
    +-- ami-transcripts-speaker
    +-- ami-transcripts-speaker-stories
    +-- ami-transcripts-stories
    +-- ami_public_manual_1.6.2
    |   +-- abstractive
    |   ...
    |   +-- words
    |   ...

Credit/Requirements

TODO

  • Overlapping meeting transcript
  • Decision abstract