/SBNet

Official implementation of SBNet as described in "Single-branch Network for Multimodal Training".

Primary LanguagePython

SBNet (ICASSP 2023)

Official implementation of SBNet as described in "Single-branch Network for Multimodal Training".

Paper Link: SBNet

Presentation: https://youtu.be/bXeiy8kQQtY

Proposed Methodology

a) Two independent modality-specific embedding networks to extract features (left) and a conventional two-branch network (right) having two independent modality-specific branches to learn discriminative joint representations of the multimodal task. (b) Proposed network with a single modality-invariant branch.

Installation

We have used the following setup for our experiments:

python==3.6.5

CUDA and cuDNN Setup:

For tensorflow:

  • CUDA Toolkit 10.1
  • cudnn v7.6.5.32 for CUDA10.1

For PyTorch:

  • CUDA Toolkit 10.2
  • cudnn v8.2.1.32 for CUDA10.2

To install PyTorch and TensorFlow with GPU support:

  pip install tensorflow-gpu==1.13.1
  pip install torch==1.8.1+cu102 torchvision==0.9.1+cu102 torchaudio==0.8.1 -f https://download.pytorch.org/whl/torch_stable.html

Feature Extraction

We perform experiments on cross-modal verification and cross-modal matching tasks on the large-scale VoxCeleb1 dataset.

Facial Feature Extraction

For face feature extraction we use Facenet. The official implmentation from authors is available hereGitHub stars

Voice Feature Extraction

For Voice Embeddings we use the method described in Utterance Level Aggregator. The code we used is released by authors and is publicly available hereGitHub stars

Extracted Features

The face and voice features used in our work can be accessed here. Once downloaded, place the files like this:

|-- data
  |-- voice
    |-- .csv files
  |-- face
    |--  .csv files
|-- imgs
|-- ssnet_cent_git
|-- ssnet_fop
|-- twobranch_cent_git
|-- twobranch_fop

Training and Testing

FOP Loss

# Training
python main.py --save_dir ./model --batch_size 128 --max_num_epoch 100 --dim_embed 128 --split_type <face_only, voice_only, hefhev, hevhef, random, fvfv, vfvf>

# Testing
python test.py --split_type vfvf --sh unseenunheard --test random

Cent/Git Loss

# Training
python main.py --save_dir ./model --batch_size 128 --max_num_epoch 100 --split_type <face_only, voice_only, hefhev, hevhef, random, fvfv, vfvf> --loss <git, cent>

# Testing
python test.py --split_type fvfv --sh unseenunheard --test random

Baseline

For baseline results, we leverage the work from FOP.

Citation

@inproceedings{saeed2023sbnet,
  title={Single-branch Network for Multimodal Training},
  author={Saeed, Muhammad Saad and Nawaz, Shah and Yousaf and Khan, Muhammad Haris and Zaheer, Muhammad Zaigham and Nandakumar, Karthik and Yousaf, Muhammad Haroon and Mahmood, Arf},
  booktitle={ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
  year={2023},
  organization={IEEE}
}

@inproceedings{saeed2022fusion,
  title={Fusion and Orthogonal Projection for Improved Face-Voice Association},
  author={Saeed, Muhammad Saad and Khan, Muhammad Haris and Nawaz, Shah and Yousaf, Muhammad Haroon and Del Bue, Alessio},
  booktitle={ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
  pages={7057--7061},
  year={2022},
  organization={IEEE}
}