audio-visual-speech-recognition
There are 18 repositories under audio-visual-speech-recognition topic.
modelscope/FunASR
A Fundamental End-to-End Speech Recognition Toolkit and Open Source SOTA Pretrained Models, Supporting Speech Recognition, Voice Activity Detection, Text Post-processing etc.
smeetrs/deep_avsr
A PyTorch implementation of the Deep Audio-Visual Speech Recognition paper.
ankurbhatia24/MULTIMODAL-EMOTION-RECOGNITION
Human Emotion Understanding using multimodal dataset.
umbertocappellazzo/Llama-AVSR
[ICASSP 2025] Official Pytorch implementation of "Large Language Models are Strong Audio-Visual Speech Recognition Learners".
georgesterpu/Taris
Transformer-based online speech recognition system with TensorFlow 2
Sreyan88/LipGER
Code for InterSpeech 2024 Paper: LipGER: Visually-Conditioned Generative Error Correction for Robust Automatic Speech Recognition
david-gimeno/tailored-avsr
Official source code for the paper "Tailored Design of Audio-Visual Speech Recognition Models using Branchformers"
sungnyun/avsr-temporal-dynamics
(SLT 2024) Learning Video Temporal Dynamics with Cross-Modal Attention for Robust Audio-Visual Speech Recognition
aidayang/FunASR-OneClick
FunASR实时语音识别版,识别麦克风和电脑内播放的声音,电脑语音打字软件
sungnyun/cav2vec
(ICLR 2025) Multi-Task Corrupted Prediction for Learning Robust Audio-Visual Speech Representation
lzuwei/end-to-end-multiview-lipreading
End to End Multiview Lip Reading
hmeutzner/kaldi-avsr
Kaldi-based audio-visual speech recognition
karlsimsBBC/cassette-bot
🤖 📼 Command-line tool for remixing videos with time-coded transcriptions.
zulfiqar-ali01/audio-visual-Transcription
Real-Time Audio-visual Speech Recongition
luomingshuang/lipreading_with_icefall
In this repository, I try to use k2, icefall and Lhotse for lip reading. I will modify it for the lip reading task. Many different lip-reading datasets should be added. -_-
Remi-Gau/McGurk_prior_code
Code related to the fMRI experiment on the contextual modulation of the McGurk Effect
MaazKhan98/Multimodal-Emotion-Recognition-speech-facial-and-body-gestures
Human Emotion Understanding using multimodal dataset
tudorhirtopanu/av-matchmaker
Multi-speaker diarization from video using SyncNet’s cross-modal embedding space to match multiple face tracks to corresponding audio tracks.