In this repository we provide the annotations for the paper “Multimodal Annotation for Virtual Meeting Summarization”.
The data
folder contains the train, test and val (*.json) files with the annotations. The video_id in the json files can be mapped to the original dataset videos. These mappings are provided in the video_id_mappings
folder. Due to copyright issues, we are unable to provide the original video data, however you can contact the authors of the Candor dataset
to download it, inline with their terms and agreement.
├── data
│ ├── test.json
│ ├── train.json
│ └── val.json
└── video_id_mappings
├── video_id_mappings_test.txt
├── video_id_mappings_train.txt
└── video_id_mappings_val.txt
Linguistics:
Process the ASR time-stamped words to generate the sentence segments. The sentences are divided by a gap of 200 ms. Tokenize the segments using Roberta. To create the entity labels, use spaCy’s entity recognition library.
Aural:
To extract the acoustic word embeddings for the audio features, Hubert is utilized. Also pitch variance data is extracted using the TorchCrepe library.
Visual:
Video features are extracted using a pre-trained CLIP model. This repository is recommend for use. To extract hand gesture information, MediaPipe can be used.