Modeling Multimodal Social Interactions: New Challenges and Baselines with Densely Aligned Representations
This repository contains the official PyTorch implementation of the following paper:
Modeling Multimodal Social Interactions: New Challenges and Baselines with Densely Aligned Representations (CVPR 2024 Oral)
Sangmin Lee, Bolin Lai, Fiona Ryan, Bikram Boote, James M. Rehg
Paper: https://arxiv.org/abs/2403.02090Abstract Understanding social interactions involving both verbal and non-verbal cues is essential for effectively interpreting social situations. However, most prior works on multimodal social cues focus predominantly on single-person behaviors or rely on holistic visual representations that are not aligned to utterances in multi-party environments. Consequently, they are limited in modeling the intricate dynamics of multi-party interactions. In this paper, we introduce three new challenging tasks to model the fine-grained dynamics between multiple people: speaking target identification, pronoun coreference resolution, and mentioned player prediction. We contribute extensive data annotations to curate these new challenges in social deduction game settings. Furthermore, we propose a novel multimodal baseline that leverages densely aligned language-visual representations by synchronizing visual features with their corresponding utterances. This facilitates concurrently capturing verbal and non-verbal cues pertinent to social reasoning. Experiments demonstrate the effectiveness of the proposed approach with densely aligned multimodal representations in modeling fine-grained social interactions.
- python 3
- pytorch 2.0+
- transformers
- numpy
- Download the benchmark datasets (YouTube, Ego4D) from [link]
- For access to the original base datasets including videos, visit [link].
- You can download the aligned player keypoint samples from [link]
train.py
saves the weights in --checkpoint_save_dir
and shows the training logs.
To train the model, run following command:
# Training example for speaking target identification
python train.py \
--task 'STI' \
--txt_dir 'enter_the_path' --txt_labeled_dir 'enter_the_path' \
--keypoint_dir 'enter_the_path' --meta_dir 'enter_the_path' \
--data_split_file 'enter_the_path' --checkpoint_save_dir './checkpoints' \
--language_model 'bert' --max_people_num 6 --context_length 5 \
--batch_size 16 --learning_rate 0.000005 \
--epochs 200 --epochs_warmup 10
# Training example for pronoun coreference resolution
python train.py \
--task 'PCR' \
--txt_dir 'enter_the_path' --txt_labeled_dir 'enter_the_path' \
--keypoint_dir 'enter_the_path' --meta_dir 'enter_the_path' \
--data_split_file 'enter_the_path' --checkpoint_save_dir './checkpoints' \
--language_model 'bert' --max_people_num 6 --context_length 5 \
--batch_size 16 --learning_rate 0.000005 \
--epochs 200 --epochs_warmup 10
# Training example for mentioned player prediction
python train.py \
--task 'MPP' \
--txt_dir 'enter_the_path' --txt_labeled_dir 'enter_the_path' \
--keypoint_dir 'enter_the_path' --meta_dir 'enter_the_path' \
--data_split_file 'enter_the_path' --checkpoint_save_dir './checkpoints' \
--language_model 'bert' --max_people_num 6 --context_length 5 \
--batch_size 16 --learning_rate 0.000005 \
--epochs 200 --epochs_warmup 10
Descriptions of training parameters are as follows:
--task
: target task (STI or PCR or MPP)--txt_dir
: directory of anonymized transcripts--txt_labeled_dir
: directory of labeled anonymized transcripts--keypoint_dir
: directory of keypoints--meta_dir
: directory of game meta data--data_split_file
: file path for data split--checkpoint_save_dir
: directory for saving checkpoints--language_model
: language model (bert or roberta or electra)--max_people_num
: maximum number of players--context_length
: size of conversation context--batch_size
: mini-batch size--learning_rate
: learning rate--epochs
: number of total epochs--epochs_warmup
: number of visual warmup epochs- Refer to
train.py
for more details
test.py
evaluates the performance.
To test the model, run following command:
# Testing example
python test.py \
--task 'STI' \
--txt_dir 'enter_the_path' --txt_labeled_dir 'enter_the_path' \
--keypoint_dir 'enter_the_path' --meta_dir 'enter_the_path' \
--data_split_file 'enter_the_path' --checkpoint_file 'enter_the_path' \
--language_model 'bert' --max_people_num 6 --context_length 5 \
--batch_size 16
Descriptions of testing parameters are as follows:
--task
: target task (STI or PCR or MPP)--txt_dir
: directory of anonymized transcripts--txt_labeled_dir
: directory of labeled anonymized transcripts--keypoint_dir
: directory of keypoints--meta_dir
: directory of game meta data--data_split_file
: file path for data split--checkpoint_file
: file path for loading checkpoint--language_model
: language model (bert or roberta or electra)--max_people_num
: maximum number of players--context_length
: size of conversation context--batch_size
: mini-batch size- Refer to
test.py
for more details
You can download the pretrained models.
Dataset | Target Task | Pretrained Models |
---|---|---|
YouTube | Speaking Target Identification Pronoun Coreference Resolution Mentioned Player Prediction |
Baseline-BERT / Baseline-RoBERTa / Baseline-ELECTRA Baseline-BERT / Baseline-RoBERTa / Baseline-ELECTRA Baseline-BERT / Baseline-RoBERTa / Baseline-ELECTRA |
Ego4D | Speaking Target Identification Pronoun Coreference Resolution Mentioned Player Prediction |
Baseline-BERT / Baseline-RoBERTa / Baseline-ELECTRA Baseline-BERT / Baseline-RoBERTa / Baseline-ELECTRA Baseline-BERT / Baseline-RoBERTa / Baseline-ELECTRA |
If you find this work useful in your research, please cite our paper:
@inproceedings{lee2024modeling,
title={Modeling Multimodal Social Interactions: New Challenges and Baselines with Densely Aligned Representations},
author={Lee, Sangmin and Lai, Bolin and Ryan, Fiona and Boote, Bikram and Rehg, James M},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
year={2024}
}
If you use or reference the base datasets, please also cite the following paper:
@inproceedings{lai2023werewolf,
title={Werewolf Among Us: Multimodal Resources for Modeling Persuasion Behaviors in Social Deduction Games},
author={Lai, Bolin and Zhang, Hongxin and Liu, Miao and Pariani, Aryan and Ryan, Fiona and Jia, Wenqi
and Hayati, Shirley Anugrah and Rehg, James and Yang, Diyi},
booktitle={Findings of the Association for Computational Linguistics},
pages={6570--6588},
year={2023}
}