Multimodal Annotation for Virtual Meeting Summarization

In this repository we provide the annotations for the paper “Multimodal Annotation for Virtual Meeting Summarization”.

The data folder contains the train, test and val (*.json) files with the annotations. The video_id in the json files can be mapped to the original dataset videos. These mappings are provided in the video_id_mappings folder. Due to copyright issues, we are unable to provide the original video data, however you can contact the authors of the Candor datasetto download it, inline with their terms and agreement.

Data Annotations


├── data
│   ├── test.json
│   ├── train.json
│   └── val.json
└── video_id_mappings
    ├── video_id_mappings_test.txt
    ├── video_id_mappings_train.txt
    └── video_id_mappings_val.txt

Feature Extraction

Linguistics:

Process the ASR time-stamped words to generate the sentence segments. The sentences are divided by a gap of 200 ms. Tokenize the segments using Roberta. To create the entity labels, use spaCy’s entity recognition library.

Aural:

To extract the acoustic word embeddings for the audio features, Hubert is utilized. Also pitch variance data is extracted using the TorchCrepe library.

Visual:

Video features are extracted using a pre-trained CLIP model. This repository is recommend for use. To extract hand gesture information, MediaPipe can be used.

skotey/vidchatsum

Multimodal Annotation for Virtual Meeting Summarization

Data Annotations

Feature Extraction