😀 add AVSR model code by pyh
This is the repository of Auto-AVSR: Audio-Visual Speech Recognition with Automatic Labels, which is the successor of End-to-End Audio-Visual Speech Recognition with Conformers. This repository contains both training code and pre-trained models for end-to-end audio-only and visual-only speech recognition (lipreading). Additionally, we offer a tutorial that will walk you through the process of training an ASR/VSR model using your own datasets.
You can check out our gradio demo below to inference your video (English) with our audio-only, visual-only and audio-visual speech recognition models.
- Clone the repository and navigate to it:
git clone https://github.com/mpc001/auto_avsr
cd auto_avsr
- Set up the environment:
conda create -y -n autoavsr python=3.8
conda activate autoavsr
- To install the necessary packages, please follow the steps below:
-
Step 3.1. Install pytorch, torchvision, and torchaudio by following instructions here.
-
Step 3.2. Install fairseq.
git clone https://github.com/pytorch/fairseq cd fairseq pip install --editable ./
-
Step 3.3. Install ffmpeg by running the following command:
conda install -c conda-forge ffmpeg
-
Step 3.4. Install additional packages by running the following command:
pip install -r requirements.txt
- Prepare the dataset. See the instructions in the preparation folder.
For logging training process, we use wandb. To customize the yaml file, match the file name with the team name in your account, e.g. cassini.yaml. Then, change the logger
argument in conf/config.yaml. Lastly, Don't forget to specify the project
argument in conf/logger/cassini.yaml. If you do not use wandb, please append log_wandb=False
in the command.
By default, we use data/dataset=lrs3
, which corresponds to lrs3.yaml in the configuration folder. To set up experiments, please fill in the root
argument in the yaml file.
To fine-tune a ASR/VSR from a pre-trained model, for instance, LRW, you can run the command below. Note that the argument ckpt_path=[ckpt_path] transfer_frontend=True
is specifically used to load the weights of the pre-trained front-end component only.
python main.py exp_dir=[exp_dir] \
exp_name=[exp_name] \
data.modality=[modality] \
ckpt_path=[ckpt_path] \
transfer_frontend=True \
optimizer.lr=[lr] \
trainer.num_nodes=[num_nodes]
-
exp_dir
andexp_name
: The directory where the checkpoints will be saved, will be stored at the location[exp_dir]
/[exp_name]
. -
data.modality
: The valid values for the input modality:video
,audio
, andaudiovisual
. -
ckpt_path
: The absolute path to the pre-trained checkpoint file. -
transfer_frontend
: This argument loads only the front-end module of[ckpt_path]
for fine-tuning. -
optimizer.lr
: The learing rate used. Default: 1e-3. -
trainer.num_nodes
: The number of machines used. Default: 1. -
Note: The performance below were trained using 4 machines (32 GPUs), except for the models that were trained using VoxCeleb2 and/or AVSpeech, which used 8 machines (64GPUs). Additionally, for the model that was pre-trained on LRW, we used the front-end module [VSR accuracy: 89.6%; ASR accuracy: 99.1%] from the LRW model zoo for initialisation.
[Stage 1] Train the model using a 23-hour subset of LRS3 that includes only short utterances lasting no more than 4 seconds (100 frames). We set optimizer.lr
to 0.0002 at the first stage.
python main.py exp_dir=[exp_dir] \
exp_name=[exp_name] \
data.modality=[modality] \
data.dataset.train_file=[train_file] \
optimizer.lr=[lr] \
trainer.num_nodes=[num_nodes]
[Stage 2] Use the best checkpoint from stage 1 to initialise the model and train the model with the full LRS3 dataset.
python main.py exp_dir=[exp_dir] \
exp_name=[exp_name] \
data.modality=[modality] \
data.dataset.train_file=[train_file] \
optimizer.lr=[lr] \
trainer.num_nodes=[num_nodes] \
ckpt_path=[ckpt_path]
data.dataset.train_file
: The training set list. Default: lrs3_train_transcript_lengths_seg24s.csv
, which contains utterances lasting no more than 24 seconds.
python main.py exp_dir=[exp_dir] \
exp_name=[exp_name] \
data.modality=[modality] \
ckpt_path=[ckpt_path] \
trainer.num_nodes=1 \
train=False
-
ckpt_path
: The absolute path of the ensembled checkpoint file. In this case,ckpt_path
is always set the file[exp_dir]/[exp_name]/model_avg_10.pth
. Default:null
. -
decode.snr_target={snr}
can be appended to the command line if you want to test your model in a noisy environment, wheresnr
is the signal-to-noise level. Default:999999
. -
data.dataset.test_file={test_file}
can be appeneded to the command line if you want to test models on other datasets, wheretest_file
is the testing set list. Default:lrs3_test_transcript_lengths_seg24s.csv
.
python infer.py data.modality=[modality] \
ckpt_path=[ckpt_path] \
trainer.num_nodes=1 \
infer_path=[infer_path]
-
ckpt_path
: The absolute path of the ensembled checkpoint file. In this case,ckpt_path
is always set the file[exp_dir]/[exp_name]/model_avg_10.pth
. Default:null
. -
infer_path
: The absolute path to the file you'd like to transcribe.
We provide an instruction that will guide you through the process of training an ASR/VSR model on other datasets using our scripts.
Lip Reading Sentences 3 (LRS3) from Visual_Speech_Recognition_for_Multiple_Languages Github
Components | WER | url | size (MB) |
---|---|---|---|
Visual-only | |||
- | 19.1 | GoogleDrive or BaiduDrive(key: dqsy) | 891 |
Audio-only | |||
- | 1.0 | GoogleDrive or BaiduDrive(key: dvf2) | 860 |
Audio-visual | |||
- | 0.9 | GoogleDrive or BaiduDrive(key: sai5) | 1540 |
Language models | |||
- | - | GoogleDrive or BaiduDrive(key: t9ep) | 191 |
Landmarks | |||
- | - | GoogleDrive or BaiduDrive(key: mi3c) | 18577 |
The table below contains WER on the test of LRS3.
Total Training Data | Hours‡ | WER | URL | Params (M) |
---|---|---|---|---|
Visual-only | ||||
LRS3 | 438 | 36.6 | GoogleDrive / BaiduDrive (key: xv9r) | 250 |
LRS2+LRS3 | 661 | 32.7 | GoogleDrive / BaiduDrive (key: 4uew) | 250 |
LRS3+VOX2 | 1759 | 25.1 | GoogleDrive / BaiduDrive (key: vgh8) | 250 |
LRW+LRS2+LRS3+VOX2+AVSP | 3448 | 19.1 | GoogleDrive / BaiduDrive (key: dqsy) | 250 |
Audio-only | ||||
LRS3 | 438 | 2.0 | GoogleDrive / BaiduDrive (key: 2x2a) | 243 |
LRS2+LRS3 | 661 | 1.7 | GoogleDrive / BaiduDrive (key: s1ra) | 243 |
LRW+LRS2+LRS3 | 818 | 1.6 | GoogleDrive / BaiduDrive (key: 9i2w) | 243 |
LRS3+VOX2 | 1759 | 1.1 | GoogleDrive / BaiduDrive (key: x6wu) | 243 |
LRW+LRS2+LRS3+VOX2+AVSP | 3448 | 1.0 | GoogleDrive / BaiduDrive (key: dvf2) | 243 |
‡The total hours are counted by including the datasets used for both pre-training and training.
@inproceedings{ma2023auto,
author={Ma, Pingchuan and Haliassos, Alexandros and Fernandez-Lopez, Adriana and Chen, Honglie and Petridis, Stavros and Pantic, Maja},
booktitle={IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
title={Auto-AVSR: Audio-Visual Speech Recognition with Automatic Labels},
year={2023},
pages={1-5},
doi={10.1109/ICASSP49357.2023.10096889}
}
It is noted that the code can only be used for comparative or benchmarking purposes. Users can only use code supplied under a License for non-commercial purposes.
[Pingchuan Ma](pingchuan.ma16[at]imperial.ac.uk)