/avsr-tf1

Audio-Visual Speech Recognition using Sequence to Sequence Models

Primary LanguagePythonGNU General Public License v3.0GPL-3.0

AVSR-tf1

Audio-Visual Speech Recognition (AVSR) research system using sequence-to-sequence neural networks based on TensorFlow 1.13

About

AVSR-tf1 is an open-source research system for Speech Recognition.

Written entirely in Python, AVSR-tf1 aims to provide a simple and reproducible way of training and evaluating speech recognition models based on sequence to sequence neural networks. AVSR-tf1 can exploit both auditory and visual speech modalities, considered either independently (ASR, VSR) or jointly (AVSR).

Rather than providing a dense documentation to the users and contributors, the AVSR-tf1 code is designed (or strives) to be intuitive and self-explanatory, encouraging researchers and developers to understand the entire codebase and propose improvements at its lowest levels. Hence we want it to be more of a flexible research system than a black box for production.

Core functionalities

1. Extract acoustic features from audio files (librosa, TensorFlow)

  • log mel-scale spectrograms, MFCC
  • optional computation of first and second derivatives
  • optional strided frame stacking
  • write into TensorFlow-compatible format (TFRecord dataset)

2. Extract the lip region from video files (OpenFace - Tadas Baltrusaitis)

  • write into TensorFlow-compatible format (TFRecord dataset)

3. Train sequence to sequence neural networks for continuous speech recognition

  • audio-only (LAS [3])
  • visual-only (lip-reading [5])
  • audio-visual fusion
    • dual-attention decoder (WLAS [4])
    • attention-based alignment (AV-Align [6, 7])
  • flexible language units (phonemes, visemes, characters etc.)

4. Evaluate models

  • normalised Levenshtein distances
    • Character Error Rate
    • Word Error Rate

Getting started

A typical workflow is as follows:

  1. convert data into .tfrecord files
  2. train/evaluate models

Please refer to the attached examples for running audio-only, visual-only, or audio-visual speech recognition experiments.

To prepare the data, you can use the two scripts extract_faces.py and write_records_tcd.py

Dependencies

For visual/audio-visual experiments, please compile from source install OpenFace

The other dependencies are popular and easy to install Python packages, so feel free to use your preferred sources.

The supported TensorFlow version for this repository is 1.13.1, and the recommended install source is: pip install tensorflow_gpu==1.13.1.

Please get in touch in case you face any issues.

Acknowledgements

We are grateful to Eugene Brevdo of Google for his remarkable help and advice during the early stages of development. In addition, we would like to thank Derek Murray, Andreas Steiner, Khe Chai Sim for the assistance and interesting conversations, and also every TensorFlow contributor on GitHub and StackOverflow. Our work is supported by NVIDIA, which granted us a Titan Xp GPU through its academic program.

How to cite

If you use this work, please cite it as:

George Sterpu, Christian Saam. Naomi Harte. How to Teach DNNs to Pay Attention to the Visual Modality in Speech Recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2020. https://doi.org/10.1109/TASLP.2020.2980436

[bib]

@ARTICLE{Sterpu2020,
  author={G. {Sterpu} and C. {Saam} and N. {Harte}},
  journal={IEEE/ACM Transactions on Audio, Speech, and Language Processing},
  title={How to Teach DNNs to Pay Attention to the Visual Modality in Speech Recognition},
  year={2020},
  volume={},
  number={},
 pages={1-1},
}

[pdf]

or

George Sterpu, Christian Saam, and Naomi Harte. 2018. Attention-based Audio-Visual Fusion for Robust Automatic Speech Recognition. In 2018 International Conference on Multimodal Interaction (ICMI ’18), October 16–20, 2018, Boulder, CO, USA. ACM, New York, NY, USA, 5 pages. https://doi.org/10.1145/3242969.3243014

[bib]

@inproceedings{sterpu_icmi18,
  author = {George Sterpu and Christian Saam and Naomi Harte},
  title = {Attention-based Audio-Visual Fusion for Robust Automatic Speech Recognition},
  year = {2018},
  publisher = {{ACM, New York, NY, USA}},
  booktitle = {2018 International Conference on Multimodal Interaction (ICMI '18), October 16--20, 2018, Boulder, CO, USA},
  url       = {http://doi.acm.org/10.1145/3242969.3243014},
  doi       = {10.1145/3242969.3243014},
}

[pdf]

How to contribute

We are delighted to receive your feedback and help on improving AVSR-tf1. On the technical side, this could be an advice or a pull request for code refactoring (we are not Python/TensorFlow experts), adding implementations of popular features, bug reports, performance improvements, language models, support for computation in 16 bit precision or on Google TPU devices.

References

[1] Sequence to Sequence Learning with Neural Networks https://arxiv.org/abs/1409.3215

[2] Neural Machine Translation by Jointly Learning to Align and Translate https://arxiv.org/abs/1409.0473

[3] Listen, Attend and Spell https://arxiv.org/abs/1508.01211

[4] Lip Reading Sentences in the Wild https://arxiv.org/abs/1611.05358

[5] Can DNNs Learn to Lipread Full Sentences? https://arxiv.org/abs/1805.11685

[6] Attention-based Audio-Visual Fusion for Robust Automatic Speech Recognition https://arxiv.org/abs/1809.01728

[7] How to Teach DNNs to Pay Attention to the Visual Modality in Speech Recognition https://ieeexplore.ieee.org/document/9035650