This repository contains the PyTorch implementation of the following papers:
Watch or Listen: Robust Audio-Visual Speech Recognition with Visual Corruption Modeling and Reliability Scoring (CVPR2023) - AVRelScore
Joanna Hong*, Minsu Kim*, Jeongsoo Choi, and Yong Man Ro (*Equal contribution)
[Paper] [Demo Video]
Visual Context-driven Audio Feature Enhancement for Robust End-to-End Audio-Visual Speech Recognition (Interspeech 2022) - VCAFE
Joanna Hong*, Minsu Kim*, and Yong Man Ro (*Equal contribution) [Paper]
- python 3.8
- pytorch 1.8 ~ 1.9
- torchvision
- torchaudio
- ffmpeg
- av
- tensorboard
- scikit-image
- opencv-python
- pillow
- librosa
- scipy
- albumentations
LRS2/LRS3 dataset can be downloaded from the below link.
For data preprocessing, download the landmark of LRS2 and LRS3 from the repository. (Landmarks for "VSR for multiple languages models")
For visual corruption modeling, download coco_object.7z
from the repository.
Unzip and put the files at
./occlusion_patch/object_image_sr
./occlusion_patch/object_mask_x4
For audio corruption modeling, download babble noise file from here.
put the file at
./src/data/babbleNoise_resample_16K.npy
For initializing visual frontend and audio frontend, please download the pre-trained models from the repository. (resnet18_dctcn_audio/resnet18_dctcn_video)
Put the .tar file at
./checkpoints/frontend/lrw_resnet18_dctcn_audio.pth.tar
./checkpoints/frontend/lrw_resnet18_dctcn_video.pth.tar
After download the dataset and landmark, we 1) align and crop the lip centered video, 2) extract audio, 3) obtain aligned landmark. We suppose the data directory is constructed as
LRS2
├── main
| ├── *
| | └── *.mp4
| | └── *.txt
├── pretrain
| ├── *
| | └── *.mp4
| | └── *.txt
LRS3
├── trainval
| ├── *
| | └── *.mp4
| | └── *.txt
├── pretrain
| ├── *
| | └── *.mp4
| | └── *.txt
├── test
| ├── *
| | └── *.mp4
| | └── *.txt
Run preprocessing with the following commands:
# For LRS2
python preprocessing.py \
--data_path '/path_to/LRS2' \
--data_type LRS2 \
--landmark_path '/path_to/LRS2_landmarks' \
--save_path '/path_to/LRS2_processed'
# For LRS3
python preprocessing.py \
--data_path '/path_to/LRS3' \
--data_type LRS3 \
--landmark_path '/path_to/LRS3_landmarks' \
--save_path '/path_to/LRS3_processed'
Basically, you can choice model architecture with the parameter architecture
.
There are three options for the architecture
: AVRelScore
, VCAFE
, Conformer
.
To train the model, run following command:
# AVRelScore: Distributed training example using 2 GPUs on LRS2 (nproc_per_node should have the same number with gpus)
python -m torch.distributed.launch --nproc_per_node=2 \
train.py \
--data_path '/path_to/LRS2_processed' \
--data_type LRS2 \
--split_file ./src/data/LRS2/0_600.txt \
--model_conf ./src/models/model.json \
--checkpoint_dir 'enter_the_path_to_save' \
--v_frontend_checkpoint ./checkpoints/frontend/lrw_resnet18_dctcn_video.pth.tar \
--a_frontend_checkpoint ./checkpoints/frontend/lrw_resnet18_dctcn_audio.pth.tar \
--wandb_project 'wandb_project_name' \
--batch_size 4 \
--update_frequency 1 \
--epochs 200 \
--eval_step 5000 \
--visual_corruption \
--architecture AVRelScore \
--distributed \
--gpu 0,1
# VCAFE: 1 GPU training example on LRS3
python train.py \
--data_path '/path_to/LRS3_processed' \
--data_type LRS3 \
--split_file ./src/data/LRS3/0_600.txt \
--model_conf ./src/models/model.json \
--checkpoint_dir 'enter_the_path_to_save' \
--v_frontend_checkpoint ./checkpoints/frontend/lrw_resnet18_dctcn_video.pth.tar \
--a_frontend_checkpoint ./checkpoints/frontend/lrw_resnet18_dctcn_audio.pth.tar \
--wandb_project 'wandb_project_name' \
--batch_size 4 \
--update_frequency 1 \
--epochs 200 \
--eval_step 5000 \
--visual_corruption \
--architecture VCAFE \
--gpu 0
Descriptions of training parameters are as follows:
--data_path
: Preprocessed Dataset location (LRS2 or LRS3)--data_type
: Choose to train on LRS2 or LRS3--split_file
: train and validation file lists (you can do curriculum learning by changing the split_file, 0_100.txt consists of files with frames between 0 to 100; training directly on 0_600.txt is also not too bad.)--checkpoint_dir
: directory for saving checkpoints--checkpoint
: saved checkpoint where the training is resumed from--model_conf
: model_configuration--wandb_project
: if want to use wandb, please set the project name here.--batch_size
: batch size--update_frequency
: update_frquency, if you use too small batch_size increase update_frequency. Training batch_size = batch_size * udpate_frequency--epochs
: number of epochs--tot_iters
: if set, the train is finished at the total iterations set--eval_step
: every step for performing evaluation--fast_validate
: if set, validation is performed for a subset of validation data--visual_corruption
: if set, we apply visual corruption modeling during training--architecture
: choose which architecture will be trained. (options: AVRelScore, VCAFE, Conformer)--gpu
: gpu number for training--distributed
: if set, distributed training is performed- Refer to
train.py
for the other training parameters
tensorboard --logdir='./runs/logs to watch' --host='ip address of the server'
The tensorboard shows the training and validation loss, evaluation metrics.
Also, if you set wandb_project
, you can check wandb log.
To test the model, run following command:
# AVRelScore: test example on LRS2
python test.py \
--data_path '/path_to/LRS2_processed' \
--data_type LRS2 \
--model_conf ./src/models/model.json \
--split_file ./src/data/LRS2/test.ref \
--checkpoint 'enter_the_checkpoint_path' \
--architecture AVRelScore \
--results_path './test_results.txt' \
--rnnlm ./checkpoints/LM/model.pth \
--rnnlm_conf ./checkpoints/LM/model.json \
--beam_size 40 \
--ctc_weight 0.1 \
--lm_weight 0.5 \
--gpu 0
Descriptions of testing parameters are as follows:
--data_path
: Preprocessed Dataset location (LRS2 or LRS3)--data_type
: Choose to train on LRS2 or LRS3--split_file
: set to test.ref (./src/data/LRS2./test.ref or ./src/data/LRS3/test.ref)--checkpoint
: model for testing--model_conf
: model_configuration--architecture
: choose which architecture will be trained. (options: AVRelScore, VCAFE, Conformer)--gpu
: gpu number for training--rnnlm
: language model checkpoint--rnnlm_conf
: language model configuration--beam_size
: beam size--ctc_weight
: ctc weight for joint decoding--lm_weight
: language model weight for decoding- Refer to
test.py
for the other parameters
We release the pre-trained AVSR models (VCAFE and AVRelScore) on LRS2 and LRS3 datasbases. (Below WERs can be obtained at beam_width
: 40, ctc_weight
: 0.1, lm_weight
: 0.5)
Model | Dataset | WER |
---|---|---|
VCAFE | LRS2 | 4.459 |
VCAFE | LRS3 | 2.821 |
AVRelScore | LRS2 | 4.129 |
AVRelScore | LRS3 | 2.770 |
You can find the pre-trained Language Model in the following repository. Put the language model at
./checkpoints/LM/model.pth
./checkpoints/LM/model.json
Please refer to the following repository for making the audio-visual corrupted dataset.
The code are based on the following two repositories, ESPNet and VSR for Multiple Languages.
If you find this work useful in your research, please cite the papers:
@inproceedings{hong2023watch,
title={Watch or Listen: Robust Audio-Visual Speech Recognition with Visual Corruption Modeling and Reliability Scoring},
author={Hong, Joanna and Kim, Minsu and Choi, Jeongsoo and Ro, Yong Man},
booktitle={Proc. CVPR},
pages={18783--18794},
year={2023}
}
@inproceedings{hong2022visual,
title={Visual Context-driven Audio Feature Enhancement for Robust End-to-End Audio-Visual Speech Recognition},
author={Hong, Joanna and Kim, Minsu and Ro, Yong Man},
booktitle={Proc. Interspeech},
pages={2838--2842},
year={2022},
organization={ISCA}
}