(B)RAVEn: A PyTorch Lightning Implementation

Introduction

We provide code for the reproduction of the main results in Jointly Learning Visual and Auditory Speech Representations from Raw Data and BRAVEn: Improving Self-Supervised Pre-training for Visual and Auditory Speech Recognition . Our implementation is based on PyTorch Lightning.

Preparation

Installation

conda env create -f environment.yml. Change the environment prefix to match the location of miniconda3, if necessary.

Data

The datasets used in the paper can be downloaded from the following links:
- LRS3
- VoxCeleb2
- LRS2
Compute 68 landmarks per frame using e.g., RetinaFace and 2-D FAN, or download them e.g., from this repo. Each landmark file should have the same name as its corresponding video (except that it ends in .npy).

Use the following command to crop the mouths:

python preprocessing/extract_mouths.py --src_dir ${SOURCE_DIR} --tgt_dir ${TARGET_DIR} --landmarks_dir ${LANDMARKS_DIR}

RAVEn pre-trained models

Below are the checkpoints of the Base and Large models pre-trained with RAVEn on LRS3+Vox2-en.

Model	Modality	Checkpoint
Base	Video	Download
Base	Audio	Download
Large	Video	Download
Large	Audio	Download

BRAVEn pre-trained models

Below are the checkpoints of the Base, Base+, and Large models pre-trained with BRAVEn.

Model	Modality	Checkpoint
Base (LRS3)	Video	Download
Base (LRS3)	Audio	Download
Base+ (LRS3+Vox2)	Video	Download
Base+ (LRS3+Vox2)	Audio	Download
Large (LRS3+Vox2+AVS)	Video	Download
Large (LRS3+Vox2+AVS)	Audio	Download

Testing

Below are the checkpoints corresponding to Tables 1 and 2 for VSR and ASR on LRS3. Models are provided for both low- and high-resource labelled data settings. In the high-resource setting, the models are fine-tuned on the full LRS3 dataset (433 hours). In the low-resource setting, they are fine-tuned on a subset ("trainval") of LRS3 (30 hours).
In some cases, the models were re-trained so the WER may differ slightly from the ones shown in the paper (which are also reproduced below).
The paths for the slurm bash scripts used for inference are shown in the table below. Note that the scripts may need to be modified according to the cluster environment.
The language model we used in this work can be found here.

VSR

RAVEn low-resource

Model	Pre-training dataset	WER (%)	Checkpoint	Bash script
Base	LRS3	47.0	Download	scripts/vsr/lrs3_trainval/base_lrs3.sh
Base	LRS3+Vox2-en	40.2	Download	scripts/vsr/lrs3_trainval/base_lrs3vox2.sh
Large	LRS3+Vox2-en	32.5	Download	scripts/vsr/lrs3_trainval/large_lrs3vox2.sh
Large w/ ST	LRS3+Vox2-en	24.8	Download	scripts/vsr/lrs3_trainval/large_lrs3vox2_self.sh
Large w/ ST + LM	LRS3+Vox2-en	23.8	same as last row	scripts/vsr/lrs3_trainval/large_lrs3vox2_self_lm.sh

BRAVEn low-resource

Model	Pre-training dataset	WER (%)	Checkpoint	Bash script
Base	LRS3	43.4	Download	scripts/vsr/lrs3_trainval/base_lrs3_braven.sh
Base Plus	LRS3+Vox2-en	35.1	Download	scripts/vsr/lrs3_trainval/baseplus_lrs3vox2_braven.sh
Large	LRS3+Vox2-en	30.8	Download	scripts/vsr/lrs3_trainval/large_lrs3vox2_braven.sh
Large	LRS3+Vox2-en+AVS	24.8	Download	scripts/vsr/lrs3_trainval/large_lrs3vox2avs_braven.sh
Large w/ ST	LRS3+Vox2-en+AVS	21.3	Download	scripts/vsr/lrs3_trainval/large_lrs3vox2avs_self_braven.sh
Large w/ ST + LM	LRS3+Vox2-en+AVS	20.0	same as last row	scripts/vsr/lrs3_trainval/large_lrs3vox2avs_self_lm_braven.sh

RAVEn high-resource

Model	Pre-training dataset	WER (%)	Checkpoint	Bash script
Base	LRS3	39.1	Download	scripts/vsr/lrs3/base_lrs3.sh
Base	LRS3+Vox2-en	33.1	Download	scripts/vsr/lrs3/base_lrs3vox2.sh
Large	LRS3+Vox2-en	27.8	Download	scripts/vsr/lrs3/large_lrs3vox2.sh
Large w/ ST	LRS3+Vox2-en	24.4	Download	scripts/vsr/lrs3/large_lrs3vox2_self.sh
Large w/ ST + LM	LRS3+Vox2-en	23.1	same as last row	scripts/vsr/lrs3/large_lrs3vox2_self_lm.sh

BRAVEn high-resource

Model	Pre-training dataset	WER (%)	Checkpoint	Bash script
Base	LRS3	36.0	Download	scripts/vsr/lrs3/base_lrs3_braven.sh
Base Plus	LRS3+Vox2-en	28.8	Download	scripts/vsr/lrs3/baseplus_lrs3vox2_braven.sh
Large	LRS3+Vox2-en	26.6	Download	scripts/vsr/lrs3/large_lrs3vox2_braven.sh
Large	LRS3+Vox2-en+AVS	23.6	Download	scripts/vsr/lrs3/large_lrs3vox2avs_braven.sh
Large w/ ST	LRS3+Vox2-en+AVS	20.9	Download	scripts/vsr/lrs3/large_lrs3vox2avs_self_braven.sh
Large w/ ST + LM	LRS3+Vox2-en+AVS	20.1	same as last row	scripts/vsr/lrs3/large_lrs3vox2avs_self_lm_braven.sh

ASR

RAVEn low-resource

Model	Pre-training dataset	WER (%)	Checkpoint	Bash script
Base	LRS3	4.7	Download	scripts/asr/lrs3_trainval/base_lrs3.sh
Base	LRS3+Vox2-en	3.8	Download	scripts/asr/lrs3_trainval/base_lrs3vox2.sh
Large	LRS3+Vox2-en	2.7	Download	scripts/asr/lrs3_trainval/large_lrs3vox2.sh
Large w/ ST	LRS3+Vox2-en	2.3	Download	scripts/asr/lrs3_trainval/large_lrs3vox2_self.sh
Large w/ ST + LM	LRS3+Vox2-en	1.9	same as last row	scripts/asr/lrs3_trainval/large_lrs3vox2_self_lm.sh

BRAVEn low-resource

Model	Pre-training dataset	WER (%)	Checkpoint	Bash script
Base	LRS3	4.0	Download	scripts/asr/lrs3_trainval/base_lrs3_braven.sh
Base Plus	LRS3+Vox2-en	3.0	Download	scripts/asr/lrs3_trainval/baseplus_lrs3vox2_braven.sh
Large	LRS3+Vox2-en	2.3	Download	scripts/asr/lrs3_trainval/large_lrs3vox2_braven.sh
Large	LRS3+Vox2-en+AVS	2.1	Download	scripts/asr/lrs3_trainval/large_lrs3vox2avs_braven.sh
Large w/ ST	LRS3+Vox2-en+AVS	1.9	Download	scripts/asr/lrs3_trainval/large_lrs3vox2avs_self_braven.sh
Large w/ ST + LM	LRS3+Vox2-en+AVS	1.7	same as last row	scripts/asr/lrs3_trainval/large_lrs3vox2avs_self_lm_braven.sh

RAVEn high-resource

Model	Pre-training dataset	WER (%)	Checkpoint	Bash script
Base	LRS3	2.2	Download	scripts/asr/lrs3/base_lrs3.sh
Base	LRS3+Vox2-en	1.9	Download	scripts/asr/lrs3/base_lrs3vox2.sh
Large	LRS3+Vox2-en	1.4	Download	scripts/asr/lrs3/large_lrs3vox2.sh
Large w/ ST	LRS3+Vox2-en	1.4	Download	scripts/asr/lrs3/large_lrs3vox2_self.sh
Large w/ ST + LM	LRS3+Vox2-en	1.4	same as last row	scripts/asr/lrs3/large_lrs3vox2_self_lm.sh

BRAVEn high-resource

Model	Pre-training dataset	WER (%)	Checkpoint	Bash script
Base	LRS3	1.9	Download	scripts/asr/lrs3/base_lrs3_braven.sh
Base Plus	LRS3+Vox2-en	1.4	Download	scripts/asr/lrs3/baseplus_lrs3vox2_braven.sh
Large	LRS3+Vox2-en	1.2	Download	scripts/asr/lrs3/large_lrs3vox2_braven.sh
Large	LRS3+Vox2-en+AVS	1.2	Download	scripts/asr/lrs3/large_lrs3vox2avs_braven.sh
Large w/ ST	LRS3+Vox2-en+AVS	1.2	Download	scripts/asr/lrs3/large_lrs3vox2avs_self_braven.sh
Large w/ ST + LM	LRS3+Vox2-en+AVS	1.1	same as last row	scripts/asr/lrs3/large_lrs3vox2avs_self_lm_braven.sh

Code for pre-training and fine-tuning coming soon...

Citation

If you find this repo useful for your research, please consider citing the following:

@article{haliassos2022jointly,
  title={Jointly Learning Visual and Auditory Speech Representations from Raw Data},
  author={Haliassos, Alexandros and Ma, Pingchuan and Mira, Rodrigo and Petridis, Stavros and Pantic, Maja},
  journal={arXiv preprint arXiv:2212.06246},
  year={2022}
}

@inproceedings{haliassos2024braven,
  title={BRAVEn: Improving Self-supervised pre-training for Visual and Auditory Speech Recognition},
  author={Haliassos, Alexandros and Zinonos, Andreas and Mira, Rodrigo and Petridis, Stavros and Pantic, Maja},
  booktitle={ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
  pages={11431--11435},
  year={2024},
  organization={IEEE}
}

ahaliassos/raven

(B)RAVEn: A PyTorch Lightning Implementation

Introduction

Preparation

Installation

Data

RAVEn pre-trained models

BRAVEn pre-trained models

Testing

VSR

RAVEn low-resource

BRAVEn low-resource

RAVEn high-resource

BRAVEn high-resource

ASR

RAVEn low-resource

BRAVEn low-resource

RAVEn high-resource

BRAVEn high-resource

Citation