- Authors: Shoubin Yu*, Jaehong Yoon*, Mohit Bansal
- Paper: arXiv
- Project Page: homepage
- Online Demo: Coming soon
# CREMA code
./lavis/
# running scripts for CREMA training/inference
./run_scripts
- (Optional) Creating conda environment
conda create -n crema python=3.8
conda activate crema
- build from source
pip install -e .
Visual Encoder: we adopt pre-trained ViT-G (1B), the codebase downloads the model automatically.
Audio Encoder: we use pre-trained BEATs (iter3+), please download the model here, and update the path in the code
3D Encoder: we conduct off-line feature extraction following 3D-LLM, please refer to this page for per-extracted features. Please change the storage in dataset config.
Multimodal Qformer: We initialize query tokens and FC layer for each MMQA in Multimodal Q-Former form pre-trained BLIP-2 model checkpoints. We hold Multimodal Q-Fromer with pre-trained MMQA-audio and MMQA-3D via HuggingFace, and Multimodal Q-Fromer initilized from BLIP-2 can be found here.
Dataset | Modalities |
---|---|
SQA3D | Video+3D+Depth |
SQA3D | Video+3D+Depth (espresso) |
MUSIC-AVQA | Video+Audio+Flow |
MUSIC-AVQA | Video+Audio+Flow (espresso) |
NExT-QA | Video+Flow+Depth+Normal |
NExT-QA | Video+Flow+Depth+Normal (espresso) |
We test our model on:
-
MUSIC-AVQA: we follow the orginal MUSIC-AVQA data format.
We extract various extra modalities from raw video with pre-train models, please refer to each model repo and paper appendix for more details.
We will share extracted features in the following table soon.
Dataset | Multimodal Features |
---|---|
SQA3D | Video Frames, Depth Map |
MUSIC-AVQA | Video Frames, Optical Flow |
NExT-QA | Video Frames, Depth Map, Optical Flow, Surface Normals |
We pre-train MMQA in CRMEA framework with public modality-specific dataset:
We provide CREMA training and inference script examples as follows.
sh run_scripts/crema/finetune/sqa3d.sh
sh run_scripts/crema/inference/sqa3d.sh
We thank the developers of LAVIS, BLIP-2, CLIP, X-InstructBLIP, for their public code release.
Please cite our paper if you use our models in your works:
@article{yu2024crema,
title={CREMA: Multimodal Compositional Video Reasoning via Efficient Modular Adaptation and Fusion},
author={Yu, Shoubin and Yoon, Jaehong and Bansal, Mohit},
journal={arXiv preprint arXiv:2402.05889},
year={2024}
}