awesome-speech-recognition-speech-synthesis-papers

automatic speech recognition/speech synthesis paper roadmap, including HMM, DNN, RNN, CNN, Seq2Seq, Attention

Introduction

Automatic Speech Recognition has been investigated for several decades, and speech recognition models are from HMM-GMM to deep neural networks today. It's very necessary to see the history of speech recognition by this awesome paper roadmap. I will cover papers from traditional models to nowadays popular models, not only acoustic models or ASR systems, but also many interesting language models.

Paper List

Automatic Speech Recognition

An Introduction to the Application of the Theory of Probabilistic Functions of a Markov Process to Automatic Speech Recognition(1982), S. E. LEVINSON et al. [pdf]
A Maximum Likelihood Approach to Continuous Speech Recognition(1983), LALIT R. BAHL et al. [pdf]
Heterogeneous Acoustic Measurements and Multiple Classifiers for Speech Recognition(1986), Andrew K. Halberstadt. [pdf]
Maximum Mutual Information Estimation of Hidden Markov Model Parameters for Speech Recognition(1986), Lalit R. Bahi et al. [pdf]
A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition(1989), Lawrence R Rabiner. [pdf]
Phoneme recognition using time-delay neural networks(1989), Alexander H. Waibel et al. [pdf]
Speaker-independent phone recognition using hidden Markov models(1989), Kai-Fu Lee et al. [pdf]
Hidden Markov Models for Speech Recognition(1991), B. H. Juang et al. [pdf]
Connectionist Speech Recognition: A Hybrid Approach(1994), Herve Bourlard et al. [pdf]
A post-processing system to yield reduced word error rates: Recognizer Output Voting Error Reduction (ROVER)(1997), J.G. Fiscus. [pdf]
Speech recognition with weighted finite-state transducers(2001), M Mohri et al. [pdf]
Review of Tdnn (time Delay Neural Network) Architectures for Speech Recognition(2014), Masahide Sugiyamat et al. [pdf]
Framewise phoneme classification with bidirectional LSTM and other neural network architectures(2005), Alex Graves et al. [pdf]
Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks(2006), Alex Graves et al. [pdf]
The kaldi speech recognition toolkit(2011), Daniel Povey et al. [pdf]
Applying Convolutional Neural Networks concepts to hybrid NN-HMM model for speech recognition(2012), Ossama Abdel-Hamid et al. [pdf]
Context-Dependent Pre-Trained Deep Neural Networks for Large-Vocabulary Speech Recognition(2012), George E. Dahl et al. [pdf]
Deep Neural Networks for Acoustic Modeling in Speech Recognition(2012), Geoffrey Hinton et al. [pdf]
Sequence Transduction with Recurrent Neural Networks(2012), Alex Graves et al. [pdf]
Deep convolutional neural networks for LVCSR(2013), Tara N. Sainath et al. [pdf]
Improving deep neural networks for LVCSR using rectified linear units and dropout(2013), George E. Dahl et al. [pdf]
Improving low-resource CD-DNN-HMM using dropout and multilingual DNN training(2013), Yajie Miao et al. [pdf]
Improvements to deep convolutional neural networks for LVCSR(2013), Tara N. Sainath et al. [pdf]
Machine Learning Paradigms for Speech Recognition: An Overview(2013), Li Deng et al. [pdf]
Recent advances in deep learning for speech research at Microsoft(2013), Li Deng et al. [pdf]
Speech recognition with deep recurrent neural networks(2013), Alex Graves et al. [pdf]
Convolutional deep maxout networks for phone recognition(2014), László Tóth et al. [pdf]
Convolutional Neural Networks for Speech Recognition(2014), Ossama Abdel-Hamid et al. [pdf]
Combining time- and frequency-domain convolution in convolutional neural network-based phone recognition(2014), László Tóth. [pdf]
Deep Speech: Scaling up end-to-end speech recognition(2014), Awni Y. Hannun et al. [pdf]
End-to-end Continuous Speech Recognition using Attention-based Recurrent NN: First Results(2014), Jan Chorowski et al. [pdf]
First-Pass Large Vocabulary Continuous Speech Recognition using Bi-Directional Recurrent DNNs(2014), Andrew L. Maas et al. [pdf]
Long short-term memory recurrent neural network architectures for large scale acoustic modeling(2014), Hasim Sak et al. [pdf]
Robust CNN-based speech recognition with Gabor filter kernels(2014), Shuo-Yiin Chang et al. [pdf]
Stochastic pooling maxout networks for low-resource speech recognition(2014), Meng Cai et al. [pdf]
Towards End-to-End Speech Recognition with Recurrent Neural Networks(2014), Alex Graves et al. [pdf]
A neural transducer(2015), N Jaitly et al. [pdf]
Attention-Based Models for Speech Recognition(2015), Jan Chorowski et al. [pdf]
Analysis of CNN-based speech recognition system using raw speech as input(2015), Dimitri Palaz et al. [pdf]
Convolutional, Long Short-Term Memory, fully connected Deep Neural Networks(2015), Tara N. Sainath et al. [pdf]
Deep convolutional neural networks for acoustic modeling in low resource languages(2015), William Chan et al. [pdf]
Deep Neural Networks for Single-Channel Multi-Talker Speech Recognition(2015), Chao Weng et al. [pdf]
EESEN: End-to-end speech recognition using deep RNN models and WFST-based decoding(2015), Y Miao et al. [pdf]
Fast and Accurate Recurrent Neural Network Acoustic Models for Speech Recognition(2015), Hasim Sak et al. [pdf]
Lexicon-Free Conversational Speech Recognition with Neural Networks(2015), Andrew L. Maas et al. [pdf]
Online Sequence Training of Recurrent Neural Networks with Connectionist Temporal Classification(2015), Kyuyeon Hwang et al. [pdf]
Advances in All-Neural Speech Recognition(2016), Geoffrey Zweig et al. [pdf]
Advances in Very Deep Convolutional Neural Networks for LVCSR(2016), Tom Sercu et al. [pdf]
End-to-end attention-based large vocabulary speech recognition(2016), Dzmitry Bahdanau et al. [pdf]
Deep Convolutional Neural Networks with Layer-Wise Context Expansion and Attention(2016), Dong Yu et al. [pdf]
Deep Speech 2: End-to-End Speech Recognition in English and Mandarin(2016), Dario Amodei et al. [pdf]
End-to-end attention-based distant speech recognition with Highway LSTM(2016), Hassan Taherian. [pdf]
Joint CTC-Attention based End-to-End Speech Recognition using Multi-task Learning(2016), Suyoun Kim et al. [pdf]
Listen, attend and spell: A neural network for large vocabulary conversational speech recognition(2016), William Chan et al. [pdf]
Latent Sequence Decompositions(2016), William Chan et al. [pdf]
Modeling Time-Frequency Patterns with LSTM vs. Convolutional Architectures for LVCSR Tasks(2016), Tara N. Sainath et al. [pdf]
Recurrent Models for Auditory Attention in Multi-Microphone Distance Speech Recognition(2016), Suyoun Kim et al. [pdf]
Segmental Recurrent Neural Networks for End-to-End Speech Recognition(2016), Liang Lu et al. [pdf]
Towards better decoding and language model integration in sequence to sequence models(2016), Jan Chorowski et al. [pdf]
Very Deep Convolutional Neural Networks for Noise Robust Speech Recognition(2016), Yanmin Qian et al. [pdf]
Very Deep Convolutional Networks for End-to-End Speech Recognition(2016), Yu Zhang et al. [pdf]
Very deep multilingual convolutional neural networks for LVCSR(2016), Tom Sercu et al. [pdf]
Wav2Letter: an End-to-End ConvNet-based Speech Recognition System(2016), Ronan Collobert et al. [pdf]
WaveNet: A Generative Model for Raw Audio(2016), Aäron van den Oord et al. [pdf]
Attentive Convolutional Neural Network based Speech Emotion Recognition: A Study on the Impact of Input Features, Signal Length, and Acted Speech(2017), Michael Neumann et al. [pdf]
An enhanced automatic speech recognition system for Arabic(2017), Mohamed Amine Menacer et al. [pdf]
Advances in Joint CTC-Attention based End-to-End Speech Recognition with a Deep CNN Encoder and RNN-LM(2017), Takaaki Hori et al. [pdf]
A network of deep neural networks for distant speech recognition(2017), Mirco Ravanelli et al. [pdf]
An online sequence-to-sequence model for noisy speech recognition(2017), Chung-Cheng Chiu et al. [pdf]
An Unsupervised Speaker Clustering Technique based on SOM and I-vectors for Speech Recognition Systems(2017), Hany Ahmed et al. [pdf]
Attention-Based End-to-End Speech Recognition in Mandarin(2017), C Shan et al. [pdf]
Building DNN acoustic models for large vocabulary speech recognition(2017), Andrew L. Maas et al. [pdf]
Direct Acoustics-to-Word Models for English Conversational Speech Recognition(2017), Kartik Audhkhasi et al. [pdf]
Deep Learning for Environmentally Robust Speech Recognition: An Overview of Recent Developments(2017), Zixing Zhang et al. [pdf]
English Conversational Telephone Speech Recognition by Humans and Machines(2017), George Saon et al. [pdf]
ESE: Efficient Speech Recognition Engine with Sparse LSTM on FPGA(2017), Song Han et al. [pdf]
Exploring Speech Enhancement with Generative Adversarial Networks for Robust Speech Recognition(2017), Chris Donahue et al. [pdf]
Deep LSTM for Large Vocabulary Continuous Speech Recognition(2017), Xu Tian et al. [pdf]
Dynamic Layer Normalization for Adaptive Neural Acoustic Modeling in Speech Recognition(2017), Taesup Kim et al. [pdf]
Gram-CTC: Automatic Unit Selection and Target Decomposition for Sequence Labelling(2017), Hairong Liu et al. [pdf]
Improving the Performance of Online Neural Transducer Models(2017), Tara N. Sainath et al. [pdf]
Learning Filterbanks from Raw Speech for Phone Recognition(2017), Neil Zeghidour et al. [pdf]
Multichannel End-to-end Speech Recognition(2017), Tsubasa Ochiai et al. [pdf]
Multi-task Learning with CTC and Segmental CRF for Speech Recognition(2017), Liang Lu et al. [pdf]
Multichannel Signal Processing With Deep Neural Networks for Automatic Speech Recognition(2017), Tara N. Sainath et al. [pdf]
Multilingual Speech Recognition With A Single End-To-End Model(2017), Shubham Toshniwal et al. [pdf]
Optimizing expected word error rate via sampling for speech recognition(2017), Matt Shannon. [pdf]
Residual Convolutional CTC Networks for Automatic Speech Recognition(2017), Yisen Wang et al. [pdf]
Residual LSTM: Design of a Deep Recurrent Architecture for Distant Speech Recognition(2017), Jaeyoung Kim et al. [pdf]
Recurrent Models for Auditory Attention in Multi-Microphone Distance Speech Recognition(2017), Suyoun Kim et al. [pdf]
Reducing Bias in Production Speech Models(2017), Eric Battenberg et al. [pdf]
Robust Speech Recognition Using Generative Adversarial Networks(2017), Anuroop Sriram et al. [pdf]
State-of-the-art Speech Recognition With Sequence-to-Sequence Models(2017), Chung-Cheng Chiu et al. [pdf]
Towards Language-Universal End-to-End Speech Recognition(2017), Suyoun Kim et al. [pdf]
Accelerating recurrent neural network language model based online speech recognition system(2018), K Lee et al. [pdf]
An improved hybrid CTC-Attention model for speech recognition(2018), Zhe Yuan et al. [pdf]
Hybrid CTC-Attention based End-to-End Speech Recognition using Subword Units(2018), Zhangyu Xiao et al. [pdf]

Speaker Verification

Speaker Verification Using Adapted Gaussian Mixture Models(2000), Douglas A.Reynolds et al. [pdf]
A tutorial on text-independent speaker verification(2004), Frédéric Bimbot et al. [pdf]
Deep neural networks for small footprint text-dependent speaker verification(2014), E Variani et al. [pdf]
Deep Speaker Vectors for Semi Text-independent Speaker Verification(2015), Lantian Li et al. [pdf]
Deep Speaker: an End-to-End Neural Speaker Embedding System(2017), Chao Li et al. [pdf]
Deep Speaker Feature Learning for Text-independent Speaker Verification(2017), Lantian Li et al. [pdf]
Deep Speaker Verification: Do We Need End to End?(2017), Dong Wang et al. [pdf]
Speaker Diarization with LSTM(2017), Quan Wang et al. [pdf]
Text-Independent Speaker Verification Using 3D Convolutional Neural Networks(2017), Amirsina Torfi et al. [pdf]

Speech Synthesis

Signal estimation from modified short-time Fourier transform(1993), Daniel W. Griffin et al. [pdf]
Text-to-speech synthesis(2009), Paul Taylor et al. [pdf]
A fast Griffin-Lim algorithm(2013), Nathanael Perraudin et al. [pdf]
TTS synthesis with bidirectional LSTM based recurrent neural networks(2014), Yuchen Fan et al. [pdf]
First Step Towards End-to-End Parametric TTS Synthesis: Generating Spectral Parameters with Neural Attention(2016), Wenfu Wang et al. [pdf]
Recent Advances in Google Real-Time HMM-Driven Unit Selection Synthesizer(2016), Xavi Gonzalvo et al. [pdf]
SampleRNN: An Unconditional End-to-End Neural Audio Generation Model(2016), Soroush Mehri et al. [pdf]
WaveNet: A Generative Model for Raw Audio(2016), Aäron van den Oord et al. [pdf]
Char2Wav: End-to-end speech synthesis(2017), J Sotelo et al. [pdf]
Deep Voice: Real-time Neural Text-to-Speech(2017), Sercan O. Arik et al. [pdf]
Deep Voice 2: Multi-Speaker Neural Text-to-Speech(2017), Sercan Arik et al. [pdf]
Deep Voice 3: 2000-Speaker Neural Text-to-speech(2017), Wei Ping et al. [pdf]
Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions(2017), Jonathan Shen et al. [pdf]
Parallel WaveNet: Fast High-Fidelity Speech Synthesis(2017), Aaron van den Oord et al. [pdf]
Statistical Parametric Speech Synthesis Using Generative Adversarial Networks Under A Multi-task Learning Framework(2017), S Yang et al. [pdf]
Tacotron: Towards End-to-End Speech Synthesis(2017), Yuxuan Wang et al. [pdf]
Uncovering Latent Style Factors for Expressive Speech Synthesis(2017), Yuxuan Wang et al. [pdf]
VoiceLoop: Voice Fitting and Synthesis via a Phonological Loop(2017), Yaniv Taigman et al. [pdf]
Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions(2017), Jonathan Shen et al. [pdf]
ClariNet: Parallel Wave Generation in End-to-End Text-to-Speech(2018), Wei Ping et al. [pdf]
Deep Feed-forward Sequential Memory Networks for Speech Synthesis(2018), Mengxiao Bi et al. [pdf]
Neural Voice Cloning with a Few Samples(2018), Sercan O. Arık et al. [pdf]
Style Tokens: Unsupervised Style Modeling, Control and Transfer in End-to-End Speech Synthesis(2018), Y Wang et al. [pdf]
Towards End-to-End Prosody Transfer for Expressive Speech Synthesis with Tacotron(2018), RJ Skerry-Ryan et al. [pdf]
Fast spectrogram inversion using multi-head convolutional neural networks(2019), SÖ Arık et al. [pdf]
Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis(2019), Ye Jia et al. [pdf]

Language Modelling

Class-Based n-gram Models of Natural Language(1992), Peter F. Brown et al. [pdf]
An empirical study of smoothing techniques for language modeling(1996), Stanley F. Chen et al. [pdf]
A Neural Probabilistic Language Model(2000), Yoshua Bengio et al. [pdf]
A new statistical approach to Chinese Pinyin input(2000), Zheng Chen et al. [pdf]
Discriminative n-gram language modeling(2007), Brian Roark et al. [pdf]
Neural Network Language Model for Chinese Pinyin Input Method Engine(2015), S Chen et al. [pdf]
Efficient Training and Evaluation of Recurrent Neural Network Language Models for Automatic Speech Recognition(2016), Xie Chen et al. [pdf]
Exploring the limits of language modeling(2016), R Jozefowicz et al. [pdf]
On the State of the Art of Evaluation in Neural Language Models(2016), G Melis et al. [pdf]

Music Modelling

Onsets and Frames: Dual-Objective Piano Transcription(2017), Curtis Hawthorne et al. [pdf]

Contact Me

For any questions, welcome to send email to :zzw922cn@gmail.com. Thanks!

fxw915/awesome-speech-recognition-speech-synthesis-papers