awesome-speech-recognition-speech-synthesis-papers

automatic speech recognition/speech synthesis paper roadmap, including HMM, DNN, RNN, CNN, Seq2Seq, Attention

Introduction

Automatic Speech Recognition has been investigated for several decades, and speech recognition models are from HMM-GMM to deep neural networks today. It's very necessary to see the history of speech recognition by this awesome paper roadmap. I will cover papers from traditional models to nowadays popular models, not only acoustic models or ASR systems, but also many interesting language models.

Automatic Speech Recognition

An Introduction to the Application of the Theory of Probabilistic Functions of a Markov Process to Automatic Speech Recognition(1982), S. E. LEVINSON et al. [pdf]
A Maximum Likelihood Approach to Continuous Speech Recognition(1983), LALIT R. BAHL et al. [pdf]
Heterogeneous Acoustic Measurements and Multiple Classifiers for Speech Recognition(1986), Andrew K. Halberstadt. [pdf]
Maximum Mutual Information Estimation of Hidden Markov Model Parameters for Speech Recognition(1986), Lalit R. Bahi et al. [pdf]
A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition(1989), Lawrence R Rabiner. [pdf]
Phoneme recognition using time-delay neural networks(1989), Alexander H. Waibel et al. [pdf]
Speaker-independent phone recognition using hidden Markov models(1989), Kai-Fu Lee et al. [pdf]
Hidden Markov Models for Speech Recognition(1991), B. H. Juang et al. [pdf]
Review of Tdnn (time Delay Neural Network) Architectures for Speech Recognition(2014), Masahide Sugiyamat et al. [pdf]
Connectionist Speech Recognition: A Hybrid Approach(1994), Herve Bourlard et al. [pdf]
A post-processing system to yield reduced word error rates: Recognizer Output Voting Error Reduction (ROVER)(1997), J.G. Fiscus. [pdf]
Speech recognition with weighted finite-state transducers(2001), M Mohri et al. [pdf]
Framewise phoneme classification with bidirectional LSTM and other neural network architectures(2005), Alex Graves et al. [pdf]
Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks(2006), Alex Graves et al. [pdf]
The kaldi speech recognition toolkit(2011), Daniel Povey et al. [pdf]
Applying Convolutional Neural Networks concepts to hybrid NN-HMM model for speech recognition(2012), Ossama Abdel-Hamid et al. [pdf]
Context-Dependent Pre-Trained Deep Neural Networks for Large-Vocabulary Speech Recognition(2012), George E. Dahl et al. [pdf]
Deep Neural Networks for Acoustic Modeling in Speech Recognition(2012), Geoffrey Hinton et al. [pdf]
Sequence Transduction with Recurrent Neural Networks(2012), Alex Graves et al. [pdf]
Deep convolutional neural networks for LVCSR(2013), Tara N. Sainath et al. [pdf]
Improving deep neural networks for LVCSR using rectified linear units and dropout(2013), George E. Dahl et al. [pdf]
Improving low-resource CD-DNN-HMM using dropout and multilingual DNN training(2013), Yajie Miao et al. [pdf]
Improvements to deep convolutional neural networks for LVCSR(2013), Tara N. Sainath et al. [pdf]
Machine Learning Paradigms for Speech Recognition: An Overview(2013), Li Deng et al. [pdf]
Recent advances in deep learning for speech research at Microsoft(2013), Li Deng et al. [pdf]
Speech recognition with deep recurrent neural networks(2013), Alex Graves et al. [pdf]
Convolutional deep maxout networks for phone recognition(2014), László Tóth et al. [pdf]
Convolutional Neural Networks for Speech Recognition(2014), Ossama Abdel-Hamid et al. [pdf]
Combining time- and frequency-domain convolution in convolutional neural network-based phone recognition(2014), László Tóth. [pdf]
Deep Speech: Scaling up end-to-end speech recognition(2014), Awni Y. Hannun et al. [pdf]
End-to-end Continuous Speech Recognition using Attention-based Recurrent NN: First Results(2014), Jan Chorowski et al. [pdf]
First-Pass Large Vocabulary Continuous Speech Recognition using Bi-Directional Recurrent DNNs(2014), Andrew L. Maas et al. [pdf]
Long short-term memory recurrent neural network architectures for large scale acoustic modeling(2014), Hasim Sak et al. [pdf]
Robust CNN-based speech recognition with Gabor filter kernels(2014), Shuo-Yiin Chang et al. [pdf]
Stochastic pooling maxout networks for low-resource speech recognition(2014), Meng Cai et al. [pdf]
Towards End-to-End Speech Recognition with Recurrent Neural Networks(2014), Alex Graves et al. [pdf]
A neural transducer(2015), N Jaitly et al. [pdf]
Attention-Based Models for Speech Recognition(2015), Jan Chorowski et al. [pdf]
Analysis of CNN-based speech recognition system using raw speech as input(2015), Dimitri Palaz et al. [pdf]
Convolutional, Long Short-Term Memory, fully connected Deep Neural Networks(2015), Tara N. Sainath et al. [pdf]
Deep convolutional neural networks for acoustic modeling in low resource languages(2015), William Chan et al. [pdf]
Deep Neural Networks for Single-Channel Multi-Talker Speech Recognition(2015), Chao Weng et al. [pdf]
EESEN: End-to-end speech recognition using deep RNN models and WFST-based decoding(2015), Y Miao et al. [pdf]
Fast and Accurate Recurrent Neural Network Acoustic Models for Speech Recognition(2015), Hasim Sak et al. [pdf]
Lexicon-Free Conversational Speech Recognition with Neural Networks(2015), Andrew L. Maas et al. [pdf]
Online Sequence Training of Recurrent Neural Networks with Connectionist Temporal Classification(2015), Kyuyeon Hwang et al. [pdf]
Advances in All-Neural Speech Recognition(2016), Geoffrey Zweig et al. [pdf]
Advances in Very Deep Convolutional Neural Networks for LVCSR(2016), Tom Sercu et al. [pdf]
End-to-end attention-based large vocabulary speech recognition(2016), Dzmitry Bahdanau et al. [pdf]
Deep Convolutional Neural Networks with Layer-Wise Context Expansion and Attention(2016), Dong Yu et al. [pdf]
Deep Speech 2: End-to-End Speech Recognition in English and Mandarin(2016), Dario Amodei et al. [pdf]
End-to-end attention-based distant speech recognition with Highway LSTM(2016), Hassan Taherian. [pdf]
Joint CTC-Attention based End-to-End Speech Recognition using Multi-task Learning(2016), Suyoun Kim et al. [pdf]
Listen, attend and spell: A neural network for large vocabulary conversational speech recognition(2016), William Chan et al. [pdf]
Latent Sequence Decompositions(2016), William Chan et al. [pdf]
Modeling Time-Frequency Patterns with LSTM vs. Convolutional Architectures for LVCSR Tasks(2016), Tara N. Sainath et al. [pdf]
Recurrent Models for Auditory Attention in Multi-Microphone Distance Speech Recognition(2016), Suyoun Kim et al. [pdf]
Segmental Recurrent Neural Networks for End-to-End Speech Recognition(2016), Liang Lu et al. [pdf]
Towards better decoding and language model integration in sequence to sequence models(2016), Jan Chorowski et al. [pdf]
Very Deep Convolutional Neural Networks for Noise Robust Speech Recognition(2016), Yanmin Qian et al. [pdf]
Very Deep Convolutional Networks for End-to-End Speech Recognition(2016), Yu Zhang et al. [pdf]
Very deep multilingual convolutional neural networks for LVCSR(2016), Tom Sercu et al. [pdf]
Wav2Letter: an End-to-End ConvNet-based Speech Recognition System(2016), Ronan Collobert et al. [pdf]
WaveNet: A Generative Model for Raw Audio(2016), Aäron van den Oord et al. [pdf]
Attentive Convolutional Neural Network based Speech Emotion Recognition: A Study on the Impact of Input Features, Signal Length, and Acted Speech(2017), Michael Neumann et al. [pdf]
An enhanced automatic speech recognition system for Arabic(2017), Mohamed Amine Menacer et al. [pdf]
Advances in Joint CTC-Attention based End-to-End Speech Recognition with a Deep CNN Encoder and RNN-LM(2017), Takaaki Hori et al. [pdf]
A network of deep neural networks for distant speech recognition(2017), Mirco Ravanelli et al. [pdf]
An online sequence-to-sequence model for noisy speech recognition(2017), Chung-Cheng Chiu et al. [pdf]
An Unsupervised Speaker Clustering Technique based on SOM and I-vectors for Speech Recognition Systems(2017), Hany Ahmed et al. [pdf]
Attention-Based End-to-End Speech Recognition in Mandarin(2017), C Shan et al. [pdf]
Building DNN acoustic models for large vocabulary speech recognition(2017), Andrew L. Maas et al. [pdf]
Direct Acoustics-to-Word Models for English Conversational Speech Recognition(2017), Kartik Audhkhasi et al. [pdf]
Deep Learning for Environmentally Robust Speech Recognition: An Overview of Recent Developments(2017), Zixing Zhang et al. [pdf]
English Conversational Telephone Speech Recognition by Humans and Machines(2017), George Saon et al. [pdf]
ESE: Efficient Speech Recognition Engine with Sparse LSTM on FPGA(2017), Song Han et al. [pdf]
Exploring Speech Enhancement with Generative Adversarial Networks for Robust Speech Recognition(2017), Chris Donahue et al. [pdf]
Deep LSTM for Large Vocabulary Continuous Speech Recognition(2017), Xu Tian et al. [pdf]
Dynamic Layer Normalization for Adaptive Neural Acoustic Modeling in Speech Recognition(2017), Taesup Kim et al. [pdf]
Gram-CTC: Automatic Unit Selection and Target Decomposition for Sequence Labelling(2017), Hairong Liu et al. [pdf]
Improving the Performance of Online Neural Transducer Models(2017), Tara N. Sainath et al. [pdf]
Learning Filterbanks from Raw Speech for Phone Recognition(2017), Neil Zeghidour et al. [pdf]
Multichannel End-to-end Speech Recognition(2017), Tsubasa Ochiai et al. [pdf]
Multi-task Learning with CTC and Segmental CRF for Speech Recognition(2017), Liang Lu et al. [pdf]
Multichannel Signal Processing With Deep Neural Networks for Automatic Speech Recognition(2017), Tara N. Sainath et al. [pdf]
Multilingual Speech Recognition With A Single End-To-End Model(2017), Shubham Toshniwal et al. [pdf]
Optimizing expected word error rate via sampling for speech recognition(2017), Matt Shannon. [pdf]
Residual Convolutional CTC Networks for Automatic Speech Recognition(2017), Yisen Wang et al. [pdf]
Residual LSTM: Design of a Deep Recurrent Architecture for Distant Speech Recognition(2017), Jaeyoung Kim et al. [pdf]
Recurrent Models for Auditory Attention in Multi-Microphone Distance Speech Recognition(2017), Suyoun Kim et al. [pdf]
Reducing Bias in Production Speech Models(2017), Eric Battenberg et al. [pdf]
Robust Speech Recognition Using Generative Adversarial Networks(2017), Anuroop Sriram et al. [pdf]
State-of-the-art Speech Recognition With Sequence-to-Sequence Models(2017), Chung-Cheng Chiu et al. [pdf]
Towards Language-Universal End-to-End Speech Recognition(2017), Suyoun Kim et al. [pdf]
Accelerating recurrent neural network language model based online speech recognition system(2018), K Lee et al. [pdf]
An improved hybrid CTC-Attention model for speech recognition(2018), Zhe Yuan et al. [pdf]
Hybrid CTC-Attention based End-to-End Speech Recognition using Subword Units(2018), Zhangyu Xiao et al. [pdf]
Improved Noisy Student Training for Automatic Speech Recognition (2020), Daniel S. Park, et al. [pdf]
ContextNet: Improving Convolutional Neural Networks for Automatic Speech Recognition with Global Context (2020), Wei Han, et al. [pdf]

Speaker Verification

Speaker Verification Using Adapted Gaussian Mixture Models(2000), Douglas A.Reynolds et al. [pdf]
A tutorial on text-independent speaker verification(2004), Frédéric Bimbot et al. [pdf]
Deep neural networks for small footprint text-dependent speaker verification(2014), E Variani et al. [pdf]
Deep Speaker Vectors for Semi Text-independent Speaker Verification(2015), Lantian Li et al. [pdf]
Deep Speaker: an End-to-End Neural Speaker Embedding System(2017), Chao Li et al. [pdf]
Deep Speaker Feature Learning for Text-independent Speaker Verification(2017), Lantian Li et al. [pdf]
Deep Speaker Verification: Do We Need End to End?(2017), Dong Wang et al. [pdf]
Speaker Diarization with LSTM(2017), Quan Wang et al. [pdf]
Text-Independent Speaker Verification Using 3D Convolutional Neural Networks(2017), Amirsina Torfi et al. [pdf]
End-to-End Text-Independent Speaker Verification with Triplet Loss on Short Utterances(2017), Chunlei Zhang et al. [pdf]
Deep Neural Network Embeddings for Text-Independent Speaker Verification(2017), David Snyder et al. [pdf]
Deep Discriminative Embeddings for Duration Robust Speaker Verification(2018), Na Li et al. [pdf]
Learning Discriminative Features for Speaker Identification and Verification(2018), Sarthak Yadav et al. [pdf]
Large Margin Softmax Loss for Speaker Verification(2019), Yi Liu et al. [pdf]
Unsupervised feature enhancement for speaker verification(2019), Phani Sankar Nidadavolu et al. [pdf]
Feature enhancement with deep feature losses for speaker verification(2019), Saurabh Kataria et al. [pdf]
Generalized end2end loss for speaker verification(2019), Li Wan et al. [pdf]
Spatial Pyramid Encoding with Convex Length Normalization for Text-Independent Speaker Verification(2019), Youngmoon Jung et al. [pdf]
VoxSRC 2019: The first VoxCeleb Speaker Recognition Challenge(2019), Son Chung et al. [pdf]
BUT System Description to VoxCeleb Speaker Recognition Challenge 2019(2019), Hossein Zeinali et al. [pdf]

Voice Conversion

Voice conversion using deep Bidirectional Long Short-Term Memory based Recurrent Neural Networks(2015), Lifa Sun et al. [pdf]
Phonetic posteriorgrams for many-to-one voice conversion without parallel data training(2016), Lifa Sun et al. [pdf]
AutoVC: Zero-Shot Voice Style Transfer with Only Autoencoder Loss(2019), Kaizhi Qian et al. [pdf]
Unsupervised End-to-End Learning of Discrete Linguistic Units for Voice Conversion(2019), Andy T. Liu et al. [pdf]
F0-consistent many-to-many non-parallel voice conversion via conditional autoencoder(2020), Kaizhi Qian et al. [pdf]

Speech Synthesis

Signal estimation from modified short-time Fourier transform(1993), Daniel W. Griffin et al. [pdf]
Text-to-speech synthesis(2009), Paul Taylor et al. [pdf]
A fast Griffin-Lim algorithm(2013), Nathanael Perraudin et al. [pdf]
TTS synthesis with bidirectional LSTM based recurrent neural networks(2014), Yuchen Fan et al. [pdf]
First Step Towards End-to-End Parametric TTS Synthesis: Generating Spectral Parameters with Neural Attention(2016), Wenfu Wang et al. [pdf]
Recent Advances in Google Real-Time HMM-Driven Unit Selection Synthesizer(2016), Xavi Gonzalvo et al. [pdf]
SampleRNN: An Unconditional End-to-End Neural Audio Generation Model(2016), Soroush Mehri et al. [pdf]
WaveNet: A Generative Model for Raw Audio(2016), Aäron van den Oord et al. [pdf]
Char2Wav: End-to-end speech synthesis(2017), J Sotelo et al. [pdf]
Deep Voice: Real-time Neural Text-to-Speech(2017), Sercan O. Arik et al. [pdf]
Deep Voice 2: Multi-Speaker Neural Text-to-Speech(2017), Sercan Arik et al. [pdf]
Deep Voice 3: 2000-Speaker Neural Text-to-speech(2017), Wei Ping et al. [pdf]
Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions(2017), Jonathan Shen et al. [pdf]
Parallel WaveNet: Fast High-Fidelity Speech Synthesis(2017), Aaron van den Oord et al. [pdf]
Statistical Parametric Speech Synthesis Using Generative Adversarial Networks Under A Multi-task Learning Framework(2017), S Yang et al. [pdf]
Tacotron: Towards End-to-End Speech Synthesis(2017), Yuxuan Wang et al. [pdf]
Uncovering Latent Style Factors for Expressive Speech Synthesis(2017), Yuxuan Wang et al. [pdf]
VoiceLoop: Voice Fitting and Synthesis via a Phonological Loop(2017), Yaniv Taigman et al. [pdf]
ClariNet: Parallel Wave Generation in End-to-End Text-to-Speech(2018), Wei Ping et al. [pdf]
Deep Feed-forward Sequential Memory Networks for Speech Synthesis(2018), Mengxiao Bi et al. [pdf]
LPCNet: Improving Neural Speech Synthesis Through Linear Prediction(2018), Jean-Marc Valin et al. [pdf]
Learning latent representations for style control and transfer in end-to-end speech synthesis(2018), Ya-Jie Zhang et al. [pdf]
Neural Voice Cloning with a Few Samples(2018), Sercan O. Arık et al. [pdf]
Style Tokens: Unsupervised Style Modeling, Control and Transfer in End-to-End Speech Synthesis(2018), Y Wang et al. [pdf]
Towards End-to-End Prosody Transfer for Expressive Speech Synthesis with Tacotron(2018), RJ Skerry-Ryan et al. [pdf]
DurIAN: Duration Informed Attention Network For Multimodal Synthesis(2019), Chengzhu Yu et al. [pdf]
Fast spectrogram inversion using multi-head convolutional neural networks(2019), SÖ Arık et al. [pdf]
FastSpeech: Fast, Robust and Controllable Text to Speech(2019), Yi Ren et al. [pdf]
Learning to Speak Fluently in a Foreign Language: Multilingual Speech Synthesis and Cross-Language Voice Cloning(2019), Yu Zhang et al. [pdf]
MelNet: A Generative Model for Audio in the Frequency Domain(2019), Sean Vasquez et al. [pdf]
Multi-Speaker End-to-End Speech Synthesis(2019), Jihyun Park et al. [pdf]
MelGAN: Generative Adversarial Networks for Conditional Waveform Synthesis(2019), Kundan Kumar et al. [pdf]
Neural Speech Synthesis with Transformer Network(2019), Naihan Li et al. [pdf]
Parallel Neural Text-to-Speech(2019), Kainan Peng et al. [pdf]
Problem-Agnostic Speech Embeddings for Multi-Speaker Text-to-Speech with SampleRNN(2019), David Alvarez et al. [pdf]
Robust Sequence-to-Sequence Acoustic Modeling with Stepwise Monotonic Attention for Neural TTS(2019), Mutian He et al. [pdf]
Towards Transfer Learning for End-to-End Speech Synthesis from Deep Pre-Trained Language Models(2019), Wei Fang et al. [pdf]
Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis(2019), Ye Jia et al. [pdf]
WaveFlow: A Compact Flow-based Model for Raw Audio(2019), Wei Ping et al. [pdf]
Waveglow: A flow-based generative network for speech synthesis(2019), R Prenger et al. [pdf]
AlignTTS: Efficient Feed-Forward Text-to-Speech System without Explicit Alignmen(2020), Zhen Zeng et al. [pdf]
BOFFIN TTS: Few-Shot Speaker Adaptation by Bayesian Optimization(2020), Henry B.Moss et al. [pdf]
Bunched LPCNet : Vocoder for Low-cost Neural Text-To-Speech Systems(2020), Ravichander Vipperla et al. [pdf]
CopyCat: Many-to-Many Fine-Grained Prosody Transfer for Neural Text-to-Speech(2020), Sri Karlapati et al. [pdf]
End-to-End Adversarial Text-to-Speech(2020), Jeff Donahue et al. [pdf]
Flowtron: an Autoregressive Flow-based Generative Network for Text-to-Speech Synthesis(2020), Rafael Valle et al. [pdf]
Flow-TTS: A Non-Autoregressive Network for Text to Speech Based on Flow(2020), Chenfeng Miao et al. [pdf]
Fully-hierarchical fine-grained prosody modeling for interpretable speech synthesis(2020), Guangzhi Sun et al. [pdf]
Generating diverse and natural text-to-speech samples using a quantized fine-grained VAE and auto-regressive prosody prior(2020), Guangzhi Sun et al. [pdf]
Glow-TTS: A Generative Flow for Text-to-Speech via Monotonic Alignment Search(2020), Jaehyeon Kim et al. [pdf]
HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis(2020), Jungil Kong et al. [pdf]
Location-Relative Attention Mechanisms For Robust Long-Form Speech Synthesi(2020), Eric Battenberg et al. [pdf]
WaveGrad: Estimating Gradients for Waveform Generation(2020), Nanxin Chen et al. [pdf]

Language Modelling

Class-Based n-gram Models of Natural Language(1992), Peter F. Brown et al. [pdf]
An empirical study of smoothing techniques for language modeling(1996), Stanley F. Chen et al. [pdf]
A Neural Probabilistic Language Model(2000), Yoshua Bengio et al. [pdf]
A new statistical approach to Chinese Pinyin input(2000), Zheng Chen et al. [pdf]
Discriminative n-gram language modeling(2007), Brian Roark et al. [pdf]
Neural Network Language Model for Chinese Pinyin Input Method Engine(2015), S Chen et al. [pdf]
Efficient Training and Evaluation of Recurrent Neural Network Language Models for Automatic Speech Recognition(2016), Xie Chen et al. [pdf]
Exploring the limits of language modeling(2016), R Jozefowicz et al. [pdf]
On the State of the Art of Evaluation in Neural Language Models(2016), G Melis et al. [pdf]

Confidence Estimates

Estimating Confidence using Word Lattices(1997), T. Kemp et al. [pdf]
Large vocabulary decoding and confidence estimation using word posterior probabilities(2000), G. Evermann et al. [pdf]
Combining Information Sources for Confidence Estimation with CRF Models(2011), M. S. Seigel et al. [pdf]
Speaker-Adapted Confidence Measures for ASR using Deep Bidirectional Recurrent Neural Networks(2018), M. ́A. Del-Agua et al. [pdf]
Bi-Directional Lattice Recurrent Neural Networks for Confidence Estimation(2018), Q. Li et al. [pdf]
Confidence Estimation for Black Box Automatic Speech Recognition Systems Using Lattice Recurrent Neural Networks(2020), A. Kastanos et al. [pdf]

Music Modelling

Onsets and Frames: Dual-Objective Piano Transcription(2017), Curtis Hawthorne et al. [pdf]
ByteSing- A Chinese Singing Voice Synthesis System Using Duration Allocated Encoder-Decoder Acoustic Models and WaveRNN Vocoders(2020), Yu Gu et al. [pdf]
Jukebox: A Generative Model for Music(2020), Prafulla Dhariwal et al. [pdf]

Interesting papers

FlowSeq: Non-Autoregressive Conditional Sequence Generation with Generative Flow(2019), Xuezhe Ma et al. [pdf]
Learning Problem-agnostic Speech Representations from Multiple Self-supervised Tasks(2019), Santiago Pascual et al. [pdf]
Self-supervised audio representation learning for mobile devices(2019), Marco Tagliasacchi et al. [pdf]
SinGAN: Learning a Generative Model from a Single Natural Image(2019), Tamar Rott Shaham et al. [pdf]
Audio2Face: Generating Speech/Face Animation from Single Audio with Attention-Based Bidirectional LSTM Networks(2019), Guanzhong Tian et al. [pdf]

Contact Me

For any questions, welcome to send email to :zzw922cn@gmail.com. Thanks!

tathaghosh/awesome-speech-recognition-speech-synthesis-papers