Curated list of awesome papers on Contextualization of E2E ASR.
The purpose of contextualizating ASR outputs is to bias the results towards tokens, generally proper nouns or rare words or jargon, which are thought likely to be produced given the context of an audio signal. Correct transcription of these tokens might have an outsized impact on the value of the output, and incorrect transcription might otherwise be likely.
To add items to this page, open up a pull request according to our contributing guide.
End to end approaches, integrated neural modules
- Deep context: end-to-end contextual speech recognition
- Contextual Speech Recognition with Difficult Negative Training Examples
- Phoebe: Pronunciation-aware Contextualization for End-to-end Speech Recognition
- Phoneme-Based Contextualization for Cross-Lingual Speech Recognition in End-to-End Models
- Joint Grapheme and Phoneme Embeddings for Contextual End-to-End ASR
- Contextual RNN-T For Open Domain ASR
- Multistate Encoding with End-To-End Speech RNN Transducer Network
- Deep Shallow Fusion for RNN-T Personalization
- Contextualized Streaming End-to-End Speech Recognition with Trie-Based Deep Biasing and Shallow Fusion
- Context-Aware Transformer Transducer for Speech Recognition
- Contextual Adapters for Personalized Speech Recognition in Neural Transducers
- Two Stage Contextual Word Filtering for Context bias in Unified Streaming and Non-streaming Transducer
- Robust Acoustic and Semantic Contextual Biasing in Neural Transducers for Speech Recognition
- Tree-constrained Pointer Generator for End-to-end Contextual Speech Recognition
- Fast Contextual Adaptation with Neural Associative Memory for On-Device Personalized Speech Recognition
- Improving End-to-End Contextual Speech Recognition with Fine-grained Contextual Knowledge Selection
- End-to-end contextual asr based on posterior distribution adaptation for hybrid ctc/attention system
- Towards Contextual Spelling Correction for Customization of End-to-end Speech Recognition Systems
External modules such as Language Models, Error Correction models, and weighted FSTs applied to hypotheses of E2E ASR systems
- Composition-based on-the-fly rescoring for salient n-gram biasing
- Improved recognition of contact names in voice commands
- Contextual speech recognition in end-to-end neural network systems using beam search
- Recurrent Neural Network Language Model Adaptation for Conversational Speech Recognition
- End-to-end contextual speech recognition using class language models and a token passing decoder
- Contextual Recovery of Out-of-Lattice Named Entities in Automatic Speech Recognition
- Shallow-Fusion End-to-End Contextual Biasing
- Personalization of End-to-End Speech Recognition on Mobile Devices for Named Entities
- Joint Contextual Modeling for ASR Correction and Language Understanding
- Bangla Voice Command Recognition in end-to-end System Using Topic Modeling based Contextual Rescoring
- Fast and Robust Unsupervised Contextual Biasing for Speech Recognition
- Contextualizing ASR Lattice Rescoring with Hybrid Pointer Network Language Model
- Incorporating Written Domain Numeric Grammars into End-To-End Contextual Speech Recognition Systems for Improved Recognition of Numeric Sequences
- Class LM and word mapping for contextual biasing in End-to-End ASR
- Rapid RNN-T Adaptation Using Personalized Speech Synthesis and Neural Language Generator
- Hierarchical Multi-Stage Word-to-Grapheme Named Entity Corrector for Automatic Speech Recognition
- Improving accuracy of rare words for RNN-Transducer through unigram shallow fusion