/raven

Official implementation of RAVEn (ICLR 2023) and BRAVEn (ICASSP 2024)

Primary LanguagePythonMIT LicenseMIT

(B)RAVEn: A PyTorch Lightning Implementation

Introduction

We provide code for the reproduction of the main results in Jointly Learning Visual and Auditory Speech Representations from Raw Data and BRAVEn: Improving Self-Supervised Pre-training for Visual and Auditory Speech Recognition . Our implementation is based on PyTorch Lightning.

Preparation

Installation

conda env create -f environment.yml. Change the environment prefix to match the location of miniconda3, if necessary.

Data

  1. The datasets used in the paper can be downloaded from the following links:
  2. Compute 68 landmarks per frame using e.g., RetinaFace and 2-D FAN, or download them e.g., from this repo. Each landmark file should have the same name as its corresponding video (except that it ends in .npy).
  3. Use the following command to crop the mouths:
    python preprocessing/extract_mouths.py --src_dir ${SOURCE_DIR} --tgt_dir ${TARGET_DIR} --landmarks_dir ${LANDMARKS_DIR}
    

RAVEn pre-trained models

Below are the checkpoints of the Base and Large models pre-trained with RAVEn on LRS3+Vox2-en.

Model Modality Checkpoint
Base Video Download
Base Audio Download
Large Video Download
Large Audio Download

BRAVEn pre-trained models

Below are the checkpoints of the Base, Base+, and Large models pre-trained with BRAVEn.

Model Modality Checkpoint
Base (LRS3) Video Download
Base (LRS3) Audio Download
Base+ (LRS3+Vox2) Video Download
Base+ (LRS3+Vox2) Audio Download
Large (LRS3+Vox2+AVS) Video Download
Large (LRS3+Vox2+AVS) Audio Download

Testing

  • Below are the checkpoints corresponding to Tables 1 and 2 for VSR and ASR on LRS3. Models are provided for both low- and high-resource labelled data settings. In the high-resource setting, the models are fine-tuned on the full LRS3 dataset (433 hours). In the low-resource setting, they are fine-tuned on a subset ("trainval") of LRS3 (30 hours).

  • In some cases, the models were re-trained so the WER may differ slightly from the ones shown in the paper (which are also reproduced below).

  • The paths for the slurm bash scripts used for inference are shown in the table below. Note that the scripts may need to be modified according to the cluster environment.

  • The language model we used in this work can be found here.

VSR

RAVEn low-resource

Model Pre-training dataset WER (%) Checkpoint Bash script
Base LRS3 47.0 Download scripts/vsr/lrs3_trainval/base_lrs3.sh
Base LRS3+Vox2-en 40.2 Download scripts/vsr/lrs3_trainval/base_lrs3vox2.sh
Large LRS3+Vox2-en 32.5 Download scripts/vsr/lrs3_trainval/large_lrs3vox2.sh
Large w/ ST LRS3+Vox2-en 24.8 Download scripts/vsr/lrs3_trainval/large_lrs3vox2_self.sh
Large w/ ST + LM LRS3+Vox2-en 23.8 same as last row scripts/vsr/lrs3_trainval/large_lrs3vox2_self_lm.sh

BRAVEn low-resource

Model Pre-training dataset WER (%) Checkpoint Bash script
Base LRS3 43.4 Download scripts/vsr/lrs3_trainval/base_lrs3_braven.sh
Base Plus LRS3+Vox2-en 35.1 Download scripts/vsr/lrs3_trainval/baseplus_lrs3vox2_braven.sh
Large LRS3+Vox2-en 30.8 Download scripts/vsr/lrs3_trainval/large_lrs3vox2_braven.sh
Large LRS3+Vox2-en+AVS 24.8 Download scripts/vsr/lrs3_trainval/large_lrs3vox2avs_braven.sh
Large w/ ST LRS3+Vox2-en+AVS 21.3 Download scripts/vsr/lrs3_trainval/large_lrs3vox2avs_self_braven.sh
Large w/ ST + LM LRS3+Vox2-en+AVS 20.0 same as last row scripts/vsr/lrs3_trainval/large_lrs3vox2avs_self_lm_braven.sh

RAVEn high-resource

Model Pre-training dataset WER (%) Checkpoint Bash script
Base LRS3 39.1 Download scripts/vsr/lrs3/base_lrs3.sh
Base LRS3+Vox2-en 33.1 Download scripts/vsr/lrs3/base_lrs3vox2.sh
Large LRS3+Vox2-en 27.8 Download scripts/vsr/lrs3/large_lrs3vox2.sh
Large w/ ST LRS3+Vox2-en 24.4 Download scripts/vsr/lrs3/large_lrs3vox2_self.sh
Large w/ ST + LM LRS3+Vox2-en 23.1 same as last row scripts/vsr/lrs3/large_lrs3vox2_self_lm.sh

BRAVEn high-resource

Model Pre-training dataset WER (%) Checkpoint Bash script
Base LRS3 36.0 Download scripts/vsr/lrs3/base_lrs3_braven.sh
Base Plus LRS3+Vox2-en 28.8 Download scripts/vsr/lrs3/baseplus_lrs3vox2_braven.sh
Large LRS3+Vox2-en 26.6 Download scripts/vsr/lrs3/large_lrs3vox2_braven.sh
Large LRS3+Vox2-en+AVS 23.6 Download scripts/vsr/lrs3/large_lrs3vox2avs_braven.sh
Large w/ ST LRS3+Vox2-en+AVS 20.9 Download scripts/vsr/lrs3/large_lrs3vox2avs_self_braven.sh
Large w/ ST + LM LRS3+Vox2-en+AVS 20.1 same as last row scripts/vsr/lrs3/large_lrs3vox2avs_self_lm_braven.sh

ASR

RAVEn low-resource

Model Pre-training dataset WER (%) Checkpoint Bash script
Base LRS3 4.7 Download scripts/asr/lrs3_trainval/base_lrs3.sh
Base LRS3+Vox2-en 3.8 Download scripts/asr/lrs3_trainval/base_lrs3vox2.sh
Large LRS3+Vox2-en 2.7 Download scripts/asr/lrs3_trainval/large_lrs3vox2.sh
Large w/ ST LRS3+Vox2-en 2.3 Download scripts/asr/lrs3_trainval/large_lrs3vox2_self.sh
Large w/ ST + LM LRS3+Vox2-en 1.9 same as last row scripts/asr/lrs3_trainval/large_lrs3vox2_self_lm.sh

BRAVEn low-resource

Model Pre-training dataset WER (%) Checkpoint Bash script
Base LRS3 4.0 Download scripts/asr/lrs3_trainval/base_lrs3_braven.sh
Base Plus LRS3+Vox2-en 3.0 Download scripts/asr/lrs3_trainval/baseplus_lrs3vox2_braven.sh
Large LRS3+Vox2-en 2.3 Download scripts/asr/lrs3_trainval/large_lrs3vox2_braven.sh
Large LRS3+Vox2-en+AVS 2.1 Download scripts/asr/lrs3_trainval/large_lrs3vox2avs_braven.sh
Large w/ ST LRS3+Vox2-en+AVS 1.9 Download scripts/asr/lrs3_trainval/large_lrs3vox2avs_self_braven.sh
Large w/ ST + LM LRS3+Vox2-en+AVS 1.7 same as last row scripts/asr/lrs3_trainval/large_lrs3vox2avs_self_lm_braven.sh

RAVEn high-resource

Model Pre-training dataset WER (%) Checkpoint Bash script
Base LRS3 2.2 Download scripts/asr/lrs3/base_lrs3.sh
Base LRS3+Vox2-en 1.9 Download scripts/asr/lrs3/base_lrs3vox2.sh
Large LRS3+Vox2-en 1.4 Download scripts/asr/lrs3/large_lrs3vox2.sh
Large w/ ST LRS3+Vox2-en 1.4 Download scripts/asr/lrs3/large_lrs3vox2_self.sh
Large w/ ST + LM LRS3+Vox2-en 1.4 same as last row scripts/asr/lrs3/large_lrs3vox2_self_lm.sh

BRAVEn high-resource

Model Pre-training dataset WER (%) Checkpoint Bash script
Base LRS3 1.9 Download scripts/asr/lrs3/base_lrs3_braven.sh
Base Plus LRS3+Vox2-en 1.4 Download scripts/asr/lrs3/baseplus_lrs3vox2_braven.sh
Large LRS3+Vox2-en 1.2 Download scripts/asr/lrs3/large_lrs3vox2_braven.sh
Large LRS3+Vox2-en+AVS 1.2 Download scripts/asr/lrs3/large_lrs3vox2avs_braven.sh
Large w/ ST LRS3+Vox2-en+AVS 1.2 Download scripts/asr/lrs3/large_lrs3vox2avs_self_braven.sh
Large w/ ST + LM LRS3+Vox2-en+AVS 1.1 same as last row scripts/asr/lrs3/large_lrs3vox2avs_self_lm_braven.sh

Code for pre-training and fine-tuning coming soon...

Citation

If you find this repo useful for your research, please consider citing the following:

@article{haliassos2022jointly,
  title={Jointly Learning Visual and Auditory Speech Representations from Raw Data},
  author={Haliassos, Alexandros and Ma, Pingchuan and Mira, Rodrigo and Petridis, Stavros and Pantic, Maja},
  journal={arXiv preprint arXiv:2212.06246},
  year={2022}
}
@inproceedings{haliassos2024braven,
  title={BRAVEn: Improving Self-supervised pre-training for Visual and Auditory Speech Recognition},
  author={Haliassos, Alexandros and Zinonos, Andreas and Mira, Rodrigo and Petridis, Stavros and Pantic, Maja},
  booktitle={ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
  pages={11431--11435},
  year={2024},
  organization={IEEE}
}