/MMSI

Code for "Modeling Multimodal Social Interactions: New Challenges and Baselines with Densely Aligned Representations" (CVPR 2024 Oral)

Primary LanguagePythonMIT LicenseMIT

Modeling Multimodal Social Interactions: New Challenges and Baselines with Densely Aligned Representations

This repository contains the official PyTorch implementation of the following paper:

Modeling Multimodal Social Interactions: New Challenges and Baselines with Densely Aligned Representations (CVPR 2024 Oral)
Sangmin Lee, Bolin Lai, Fiona Ryan, Bikram Boote, James M. Rehg
Paper: https://arxiv.org/abs/2403.02090

Abstract Understanding social interactions involving both verbal and non-verbal cues is essential for effectively interpreting social situations. However, most prior works on multimodal social cues focus predominantly on single-person behaviors or rely on holistic visual representations that are not aligned to utterances in multi-party environments. Consequently, they are limited in modeling the intricate dynamics of multi-party interactions. In this paper, we introduce three new challenging tasks to model the fine-grained dynamics between multiple people: speaking target identification, pronoun coreference resolution, and mentioned player prediction. We contribute extensive data annotations to curate these new challenges in social deduction game settings. Furthermore, we propose a novel multimodal baseline that leverages densely aligned language-visual representations by synchronizing visual features with their corresponding utterances. This facilitates concurrently capturing verbal and non-verbal cues pertinent to social reasoning. Experiments demonstrate the effectiveness of the proposed approach with densely aligned multimodal representations in modeling fine-grained social interactions.

Preparation

Requirements

  • python 3
  • pytorch 2.0+
  • transformers
  • numpy

Datasets

  • Download the benchmark datasets (YouTube, Ego4D) from [link]
  • For access to the original base datasets including videos, visit [link].
  • You can download the aligned player keypoint samples from [link]

Training

train.py saves the weights in --checkpoint_save_dir and shows the training logs.

To train the model, run following command:

# Training example for speaking target identification
python train.py \
--task 'STI' \
--txt_dir 'enter_the_path' --txt_labeled_dir 'enter_the_path' \
--keypoint_dir 'enter_the_path' --meta_dir 'enter_the_path' \
--data_split_file 'enter_the_path' --checkpoint_save_dir './checkpoints' \
--language_model 'bert' --max_people_num 6 --context_length 5 \
--batch_size 16 --learning_rate 0.000005 \
--epochs 200 --epochs_warmup 10
# Training example for pronoun coreference resolution
python train.py \
--task 'PCR' \
--txt_dir 'enter_the_path' --txt_labeled_dir 'enter_the_path' \
--keypoint_dir 'enter_the_path' --meta_dir 'enter_the_path' \
--data_split_file 'enter_the_path' --checkpoint_save_dir './checkpoints' \
--language_model 'bert' --max_people_num 6 --context_length 5 \
--batch_size 16 --learning_rate 0.000005 \
--epochs 200 --epochs_warmup 10
# Training example for mentioned player prediction
python train.py \
--task 'MPP' \
--txt_dir 'enter_the_path' --txt_labeled_dir 'enter_the_path' \
--keypoint_dir 'enter_the_path' --meta_dir 'enter_the_path' \
--data_split_file 'enter_the_path' --checkpoint_save_dir './checkpoints' \
--language_model 'bert' --max_people_num 6 --context_length 5 \
--batch_size 16 --learning_rate 0.000005 \
--epochs 200 --epochs_warmup 10

Descriptions of training parameters are as follows:

  • --task: target task (STI or PCR or MPP)
  • --txt_dir: directory of anonymized transcripts --txt_labeled_dir: directory of labeled anonymized transcripts
  • --keypoint_dir: directory of keypoints --meta_dir: directory of game meta data
  • --data_split_file: file path for data split --checkpoint_save_dir: directory for saving checkpoints
  • --language_model: language model (bert or roberta or electra) --max_people_num: maximum number of players
  • --context_length: size of conversation context --batch_size: mini-batch size --learning_rate: learning rate
  • --epochs: number of total epochs --epochs_warmup: number of visual warmup epochs
  • Refer to train.py for more details

Testing

test.py evaluates the performance.

To test the model, run following command:

# Testing example
python test.py \
--task 'STI' \
--txt_dir 'enter_the_path' --txt_labeled_dir 'enter_the_path' \
--keypoint_dir 'enter_the_path' --meta_dir 'enter_the_path' \
--data_split_file 'enter_the_path' --checkpoint_file 'enter_the_path' \
--language_model 'bert' --max_people_num 6 --context_length 5 \
--batch_size 16

Descriptions of testing parameters are as follows:

  • --task: target task (STI or PCR or MPP)
  • --txt_dir: directory of anonymized transcripts --txt_labeled_dir: directory of labeled anonymized transcripts
  • --keypoint_dir: directory of keypoints --meta_dir: directory of game meta data
  • --data_split_file: file path for data split --checkpoint_file: file path for loading checkpoint
  • --language_model: language model (bert or roberta or electra) --max_people_num: maximum number of players
  • --context_length: size of conversation context --batch_size: mini-batch size
  • Refer to test.py for more details

Pretrained Models

You can download the pretrained models.

Dataset Target Task Pretrained Models
YouTube Speaking Target Identification
Pronoun Coreference Resolution
Mentioned Player Prediction
Baseline-BERT / Baseline-RoBERTa / Baseline-ELECTRA
Baseline-BERT / Baseline-RoBERTa / Baseline-ELECTRA
Baseline-BERT / Baseline-RoBERTa / Baseline-ELECTRA
Ego4D Speaking Target Identification
Pronoun Coreference Resolution
Mentioned Player Prediction
Baseline-BERT / Baseline-RoBERTa / Baseline-ELECTRA
Baseline-BERT / Baseline-RoBERTa / Baseline-ELECTRA
Baseline-BERT / Baseline-RoBERTa / Baseline-ELECTRA

Citation

If you find this work useful in your research, please cite our paper:

@inproceedings{lee2024modeling,
  title={Modeling Multimodal Social Interactions: New Challenges and Baselines with Densely Aligned Representations},
  author={Lee, Sangmin and Lai, Bolin and Ryan, Fiona and Boote, Bikram and Rehg, James M},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  year={2024}
}

If you use or reference the base datasets, please also cite the following paper:

@inproceedings{lai2023werewolf,
  title={Werewolf Among Us: Multimodal Resources for Modeling Persuasion Behaviors in Social Deduction Games},
  author={Lai, Bolin and Zhang, Hongxin and Liu, Miao and Pariani, Aryan and Ryan, Fiona and Jia, Wenqi
          and Hayati, Shirley Anugrah and Rehg, James and Yang, Diyi},
  booktitle={Findings of the Association for Computational Linguistics},
  pages={6570--6588},
  year={2023}
}