MMSI: A Python repository from sangmin-git

Modeling Multimodal Social Interactions: New Challenges and Baselines with Densely Aligned Representations

This repository contains the official PyTorch implementation of the following paper:

Modeling Multimodal Social Interactions: New Challenges and Baselines with Densely Aligned Representations (CVPR 2024 Oral)
Sangmin Lee, Bolin Lai, Fiona Ryan, Bikram Boote, James M. Rehg
Paper: https://arxiv.org/abs/2403.02090

Abstract Understanding social interactions involving both verbal and non-verbal cues is essential for effectively interpreting social situations. However, most prior works on multimodal social cues focus predominantly on single-person behaviors or rely on holistic visual representations that are not aligned to utterances in multi-party environments. Consequently, they are limited in modeling the intricate dynamics of multi-party interactions. In this paper, we introduce three new challenging tasks to model the fine-grained dynamics between multiple people: speaking target identification, pronoun coreference resolution, and mentioned player prediction. We contribute extensive data annotations to curate these new challenges in social deduction game settings. Furthermore, we propose a novel multimodal baseline that leverages densely aligned language-visual representations by synchronizing visual features with their corresponding utterances. This facilitates concurrently capturing verbal and non-verbal cues pertinent to social reasoning. Experiments demonstrate the effectiveness of the proposed approach with densely aligned multimodal representations in modeling fine-grained social interactions.

Preparation

Requirements

python 3
pytorch 2.0+
transformers
numpy

Datasets

Download the benchmark datasets (YouTube, Ego4D) from [link]
For access to the original base datasets including videos, visit [link].
You can download the aligned player keypoint samples from [link]

Training

train.py saves the weights in --checkpoint_save_dir and shows the training logs.

To train the model, run following command:

# Training example for speaking target identification
python train.py \
--task 'STI' \
--txt_dir 'enter_the_path' --txt_labeled_dir 'enter_the_path' \
--keypoint_dir 'enter_the_path' --meta_dir 'enter_the_path' \
--data_split_file 'enter_the_path' --checkpoint_save_dir './checkpoints' \
--language_model 'bert' --max_people_num 6 --context_length 5 \
--batch_size 16 --learning_rate 0.000005 \
--epochs 200 --epochs_warmup 10

# Training example for pronoun coreference resolution
python train.py \
--task 'PCR' \
--txt_dir 'enter_the_path' --txt_labeled_dir 'enter_the_path' \
--keypoint_dir 'enter_the_path' --meta_dir 'enter_the_path' \
--data_split_file 'enter_the_path' --checkpoint_save_dir './checkpoints' \
--language_model 'bert' --max_people_num 6 --context_length 5 \
--batch_size 16 --learning_rate 0.000005 \
--epochs 200 --epochs_warmup 10

# Training example for mentioned player prediction
python train.py \
--task 'MPP' \
--txt_dir 'enter_the_path' --txt_labeled_dir 'enter_the_path' \
--keypoint_dir 'enter_the_path' --meta_dir 'enter_the_path' \
--data_split_file 'enter_the_path' --checkpoint_save_dir './checkpoints' \
--language_model 'bert' --max_people_num 6 --context_length 5 \
--batch_size 16 --learning_rate 0.000005 \
--epochs 200 --epochs_warmup 10

Descriptions of training parameters are as follows:

--task: target task (STI or PCR or MPP)
--txt_dir: directory of anonymized transcripts --txt_labeled_dir: directory of labeled anonymized transcripts
--keypoint_dir: directory of keypoints --meta_dir: directory of game meta data
--data_split_file: file path for data split --checkpoint_save_dir: directory for saving checkpoints
--language_model: language model (bert or roberta or electra) --max_people_num: maximum number of players
--context_length: size of conversation context --batch_size: mini-batch size --learning_rate: learning rate
--epochs: number of total epochs --epochs_warmup: number of visual warmup epochs
Refer to train.py for more details

Testing

test.py evaluates the performance.

To test the model, run following command:

# Testing example
python test.py \
--task 'STI' \
--txt_dir 'enter_the_path' --txt_labeled_dir 'enter_the_path' \
--keypoint_dir 'enter_the_path' --meta_dir 'enter_the_path' \
--data_split_file 'enter_the_path' --checkpoint_file 'enter_the_path' \
--language_model 'bert' --max_people_num 6 --context_length 5 \
--batch_size 16

Descriptions of testing parameters are as follows:

--task: target task (STI or PCR or MPP)
--txt_dir: directory of anonymized transcripts --txt_labeled_dir: directory of labeled anonymized transcripts
--keypoint_dir: directory of keypoints --meta_dir: directory of game meta data
--data_split_file: file path for data split --checkpoint_file: file path for loading checkpoint
--language_model: language model (bert or roberta or electra) --max_people_num: maximum number of players
--context_length: size of conversation context --batch_size: mini-batch size
Refer to test.py for more details

Pretrained Models

You can download the pretrained models.

Dataset	Target Task	Pretrained Models
YouTube	Speaking Target Identification Pronoun Coreference Resolution Mentioned Player Prediction	Baseline-BERT / Baseline-RoBERTa / Baseline-ELECTRA Baseline-BERT / Baseline-RoBERTa / Baseline-ELECTRA Baseline-BERT / Baseline-RoBERTa / Baseline-ELECTRA
Ego4D	Speaking Target Identification Pronoun Coreference Resolution Mentioned Player Prediction	Baseline-BERT / Baseline-RoBERTa / Baseline-ELECTRA Baseline-BERT / Baseline-RoBERTa / Baseline-ELECTRA Baseline-BERT / Baseline-RoBERTa / Baseline-ELECTRA

Citation

If you find this work useful in your research, please cite our paper:

@inproceedings{lee2024modeling,
  title={Modeling Multimodal Social Interactions: New Challenges and Baselines with Densely Aligned Representations},
  author={Lee, Sangmin and Lai, Bolin and Ryan, Fiona and Boote, Bikram and Rehg, James M},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  year={2024}
}

If you use or reference the base datasets, please also cite the following paper:

@inproceedings{lai2023werewolf,
  title={Werewolf Among Us: Multimodal Resources for Modeling Persuasion Behaviors in Social Deduction Games},
  author={Lai, Bolin and Zhang, Hongxin and Liu, Miao and Pariani, Aryan and Ryan, Fiona and Jia, Wenqi
          and Hayati, Shirley Anugrah and Rehg, James and Yang, Diyi},
  booktitle={Findings of the Association for Computational Linguistics},
  pages={6570--6588},
  year={2023}
}

sangmin-git/MMSI