Towards Multimodal Open-Set Domain Generalization and Adaptation through Self-supervision

¹ETH Zurich, ²EPFL

• ECCV 2024 •

Our proposed MOOSA framework for Multimodal Open-Set Domain Generalization and Adaptation.

Code

The code was tested using Python 3.10.13, torch 2.3.1+cu121 and NVIDIA GeForce RTX 3090, more dependencies are in requirement.txt.

Environments:

mmcv-full 1.2.7
mmaction2 0.13.0

EPIC-Kitchens Dataset

Prepare

Download Pretrained Weights

Download Audio model link, rename it as vggsound_avgpool.pth.tar and place under the EPIC-rgb-flow-audio/pretrained_models directory
Download SlowFast model for RGB modality link and place under the EPIC-rgb-flow-audio/pretrained_models directory
Download SlowOnly model for Flow modality link and place under the EPIC-rgb-flow-audio/pretrained_models directory

Download EPIC-Kitchens Dataset

bash download_script.sh

Download Audio files EPIC-KITCHENS-audio.zip.

Unzip all files and the directory structure should be modified to match:

Click for details...

├── MM-SADA_Domain_Adaptation_Splits
├── rgb
|   ├── train
|   |   ├── D1
|   |   |   ├── P08_01.wav
|   |   |   ├── P08_01
|   |   |   |     ├── frame_0000000000.jpg
|   |   |   |     ├── ...
|   |   |   ├── P08_02.wav
|   |   |   ├── P08_02
|   |   |   ├── ...
|   |   ├── D2
|   |   ├── D3
|   ├── test
|   |   ├── D1
|   |   ├── D2
|   |   ├── D3


├── flow
|   ├── train
|   |   ├── D1
|   |   |   ├── P08_01 
|   |   |   |   ├── u
|   |   |   |   |   ├── frame_0000000000.jpg
|   |   |   |   |   ├── ...
|   |   |   |   ├── v
|   |   |   ├── P08_02
|   |   |   ├── ...
|   |   ├── D2
|   |   ├── D3
|   ├── test
|   |   ├── D1
|   |   ├── D2
|   |   ├── D3

Video and Audio

Click for details...

cd EPIC-rgb-flow-audio

python train_video_flow_audio_EPIC_MOOSA.py --use_video --use_audio -s D2 D3 -t D1 --lr 1e-4 --bsz 16 --nepochs 5 --mask_ratio 0.7 --entropy_min_weight 0.001 --datapath /path/to/EPIC-KITCHENS/

python train_video_flow_audio_EPIC_MOOSA.py --use_video --use_audio -s D1 D3 -t D2 --lr 1e-4 --bsz 16 --nepochs 10 --mask_ratio 0.7 --entropy_min_weight 1.0 --datapath /path/to/EPIC-KITCHENS/

python train_video_flow_audio_EPIC_MOOSA.py --use_video --use_audio -s D1 D2 -t D3 --lr 1e-4 --bsz 16 --nepochs 20 --mask_ratio 0.7 --entropy_min_weight 0.001 --datapath /path/to/EPIC-KITCHENS/

Video and Flow

Click for details...

cd EPIC-rgb-flow-audio

python train_video_flow_audio_EPIC_MOOSA.py --use_video --use_flow -s D2 D3 -t D1 --lr 1e-4 --bsz 16 --nepochs 15 --mask_ratio 0.7 --entropy_min_weight 0.001 --datapath /path/to/EPIC-KITCHENS/

python train_video_flow_audio_EPIC_MOOSA.py --use_video --use_flow -s D1 D3 -t D2 --lr 1e-4 --bsz 16 --nepochs 20 --mask_ratio 0.7 --entropy_min_weight 1.0 --datapath /path/to/EPIC-KITCHENS/

python train_video_flow_audio_EPIC_MOOSA.py --use_video --use_flow -s D1 D2 -t D3 --lr 1e-4 --bsz 16 --nepochs 15 --mask_ratio 0.7 --entropy_min_weight 0.001 --datapath /path/to/EPIC-KITCHENS/

Flow and Audio

Click for details...

cd EPIC-rgb-flow-audio

python train_video_flow_audio_EPIC_MOOSA.py --use_flow --use_audio -s D2 D3 -t D1 --lr 1e-4 --bsz 16 --nepochs 25 --mask_ratio 0.3 --entropy_min_weight 0.001 --datapath /path/to/EPIC-KITCHENS/

python train_video_flow_audio_EPIC_MOOSA.py --use_flow --use_audio -s D1 D3 -t D2 --lr 1e-4 --bsz 16 --nepochs 10 --mask_ratio 0.3 --entropy_min_weight 0.001 --datapath /path/to/EPIC-KITCHENS/

python train_video_flow_audio_EPIC_MOOSA.py --use_flow --use_audio -s D1 D2 -t D3 --lr 1e-4 --bsz 16 --nepochs 10 --mask_ratio 0.3 --entropy_min_weight 0.001 --datapath /path/to/EPIC-KITCHENS/

Video and Flow and Audio

Click for details...

cd EPIC-rgb-flow-audio

python train_video_flow_audio_EPIC_MOOSA.py --use_video --use_flow --use_audio -s D2 D3 -t D1 --lr 1e-4 --bsz 16 --nepochs 15 --mask_ratio 0.7 --entropy_min_weight 0.001 --jigsaw_num_splits 2 --datapath /path/to/EPIC-KITCHENS/

python train_video_flow_audio_EPIC_MOOSA.py --use_video --use_flow --use_audio -s D1 D3 -t D2 --lr 1e-4 --bsz 16 --nepochs 20 --mask_ratio 0.7 --entropy_min_weight 0.1 --jigsaw_num_splits 2 --datapath /path/to/EPIC-KITCHENS/

python train_video_flow_audio_EPIC_MOOSA.py --use_video --use_flow --use_audio -s D1 D2 -t D3 --lr 1e-4 --bsz 16 --nepochs 20 --mask_ratio 0.3 --entropy_min_weight 0.1 --jigsaw_num_splits 2  --datapath /path/to/EPIC-KITCHENS/

HAC Dataset

This dataset can be downloaded at link.

Unzip all files and the directory structure should be modified to match:

Click for details...

HAC
├── human
|   ├── videos
|   |   ├── ...
|   ├── flow
|   |   ├── ...
|   ├── audio
|   |   ├── ...

├── animal
|   ├── videos
|   |   ├── ...
|   ├── flow
|   |   ├── ...
|   ├── audio
|   |   ├── ...

├── cartoon
|   ├── videos
|   |   ├── ...
|   ├── flow
|   |   ├── ...
|   ├── audio
|   |   ├── ...

Download the pretrained weights similar to EPIC-Kitchens Dataset and put under the HAC-rgb-flow-audio/pretrained_models directory.

Video and Audio

Click for details...

cd HAC-rgb-flow-audio

python train_video_flow_audio_HAC_MOOSA.py --use_video --use_audio -s 'animal' 'cartoon' -t 'human' --lr 1e-4 --bsz 16 --nepochs 5 --mask_ratio 0.3 --entropy_min_weight 0.001 --datapath /path/to/HAC/

python train_video_flow_audio_HAC_MOOSA.py --use_video --use_audio -s 'human' 'cartoon' -t 'animal' --lr 1e-4 --bsz 16  --nepochs 10 --mask_ratio 0.7 --entropy_min_weight 0.001 --datapath /path/to/HAC/

python train_video_flow_audio_HAC_MOOSA.py --use_video --use_audio -s 'human' 'animal' -t 'cartoon' --lr 1e-4 --bsz 16 --nepochs 10 --mask_ratio 0.3 --entropy_min_weight 0.001 --datapath /path/to/HAC/

Video and Flow

Click for details...

cd HAC-rgb-flow-audio

python train_video_flow_audio_HAC_MOOSA.py --use_video --use_flow -s 'animal' 'cartoon' -t 'human' --lr 1e-4 --bsz 16 --nepochs 20 --mask_ratio 0.3 --entropy_min_weight 0.001 --datapath /path/to/HAC/

python train_video_flow_audio_HAC_MOOSA.py --use_video --use_flow -s 'human' 'cartoon' -t 'animal' --lr 1e-4 --bsz 16 --nepochs 20 --mask_ratio 0.3 --entropy_min_weight 0.001 --datapath /path/to/HAC/

python train_video_flow_audio_HAC_MOOSA.py --use_video --use_flow -s 'human' 'animal' -t 'cartoon' --lr 1e-4 --bsz 16 --nepochs 20 --mask_ratio 0.3 --entropy_min_weight 0.001 --datapath /path/to/HAC/

Flow and Audio

Click for details...

cd HAC-rgb-flow-audio

python train_video_flow_audio_HAC_MOOSA.py --use_flow --use_audio -s 'animal' 'cartoon' -t 'human' --lr 1e-4 --bsz 16 --nepochs 10 --mask_ratio 0.3 --entropy_min_weight 0.001 --datapath /path/to/HAC/

python train_video_flow_audio_HAC_MOOSA.py --use_flow --use_audio -s 'human' 'cartoon' -t 'animal' --lr 1e-4 --bsz 16 --nepochs 10 --mask_ratio 0.3 --entropy_min_weight 0.001 --datapath /path/to/HAC/

python train_video_flow_audio_HAC_MOOSA.py --use_flow --use_audio -s 'human' 'animal' -t 'cartoon' --lr 1e-4 --bsz 16 --nepochs 10 --mask_ratio 0.7 --entropy_min_weight 0.001 --datapath /path/to/HAC/

Video and Flow and Audio

Click for details...

cd HAC-rgb-flow-audio

python train_video_flow_audio_HAC_MOOSA.py --use_video --use_flow --use_audio -s 'animal' 'cartoon' -t 'human' --lr 1e-4 --bsz 16 --nepochs 10 --mask_ratio 0.7 --entropy_min_weight 0.001 --jigsaw_num_splits 2 --datapath /path/to/HAC/

python train_video_flow_audio_HAC_MOOSA.py --use_video --use_flow --use_audio -s 'human' 'cartoon' -t 'animal' --lr 1e-4 --bsz 16 --nepochs 10 --mask_ratio 0.7 --entropy_min_weight 0.001 --jigsaw_num_splits 2 --jigsaw_samples 64 --datapath /path/to/HAC/

python train_video_flow_audio_HAC_MOOSA.py --use_video --use_flow --use_audio -s 'human' 'animal' -t 'cartoon' --lr 1e-4 --bsz 16 --nepochs 20 --mask_ratio 0.7 --alpha_trans 0.5 --entropy_min_weight 0.001 --jigsaw_num_splits 2 --datapath /path/to/HAC/

Multimodal Open-Set Domain Adaptation

Video and Audio

Click for details...

cd EPIC-rgb-flow-audio

python train_video_audio_EPIC_MOOSA_OSDA.py -s D1 D2 -t D2 --lr 1e-4 --bsz 16 --nepochs 10 --mask_ratio 0.3 --target_filter_thr 0.5 --datapath /path/to/EPIC-KITCHENS/

python train_video_audio_EPIC_MOOSA_OSDA.py -s D1 D3 -t D3 --lr 1e-4 --bsz 16 --nepochs 10 --mask_ratio 0.7 --target_filter_thr 0.5 --datapath /path/to/EPIC-KITCHENS/

python train_video_audio_EPIC_MOOSA_OSDA.py -s D2 D1 -t D1 --lr 1e-4 --bsz 16 --nepochs 15 --mask_ratio 0.7 --target_filter_thr 0.5 --datapath /path/to/EPIC-KITCHENS/

python train_video_audio_EPIC_MOOSA_OSDA.py -s D2 D3 -t D3 --lr 1e-4 --bsz 16 --nepochs 15 --mask_ratio 0.7 --target_filter_thr 0.5 --datapath /path/to/EPIC-KITCHENS/

python train_video_audio_EPIC_MOOSA_OSDA.py -s D3 D1 -t D1 --lr 1e-4 --bsz 16 --nepochs 20 --mask_ratio 0.7 --target_filter_thr 0.3 --datapath /path/to/EPIC-KITCHENS/

python train_video_audio_EPIC_MOOSA_OSDA.py -s D3 D2 -t D2 --lr 1e-4 --bsz 16 --nepochs 15 --mask_ratio 0.7 --target_filter_thr 0.5 --datapath /path/to/EPIC-KITCHENS/

Multimodal Open-Partial Domain Generalization

Video and Audio

Click for details...

cd EPIC-rgb-flow-audio

python train_video_audio_EPIC_MOOSA_Open_Partial.py -s D2 D3 -t D1 --lr 1e-4 --bsz 16 --nepochs 15 --mask_ratio 0.7 --entropy_min_weight 1.0 --datapath /path/to/EPIC-KITCHENS/

python train_video_audio_EPIC_MOOSA_Open_Partial.py -s D1 D3 -t D2 --lr 1e-4 --bsz 16 --nepochs 15 --mask_ratio 0.7 --entropy_min_weight 1.0 --datapath /path/to/EPIC-KITCHENS/

python train_video_audio_EPIC_MOOSA_Open_Partial.py -s D1 D2 -t D3 --lr 1e-4 --bsz 16 --nepochs 15 --mask_ratio 0.3 --entropy_min_weight 1.0 --datapath /path/to/EPIC-KITCHENS/

Contact

If you have any questions, please send an email to donghaospurs@gmail.com

Citation

If you find our work useful in your research please consider citing our paper:

@inproceedings{dong2024moosa,
    title={Towards Multimodal Open-Set Domain Generalization and Adaptation through Self-supervision},
    author={Dong, Hao and Chatzi, Eleni and Fink, Olga},
    booktitle={European Conference on Computer Vision},
    year={2024}
}

Related Projects

SimMMDG: A Simple and Effective Framework for Multi-modal Domain Generalization

MultiOOD: Scaling Out-of-Distribution Detection for Multiple Modalities

Acknowledgement

Many thanks to the excellent open-source projects SimMMDG and DomainAdaptation.

donghao51/MOOSA

Towards Multimodal Open-Set Domain Generalization and Adaptation through Self-supervision

• ECCV 2024 •

Code

EPIC-Kitchens Dataset

Prepare

Download Pretrained Weights

Download EPIC-Kitchens Dataset

Video and Audio

Video and Flow

Flow and Audio

Video and Flow and Audio

HAC Dataset

Video and Audio

Video and Flow

Flow and Audio

Video and Flow and Audio

Multimodal Open-Set Domain Adaptation

Video and Audio

Multimodal Open-Partial Domain Generalization

Video and Audio

Contact

Citation

Related Projects

Acknowledgement