Deep-learning-based Text-to-speech-TTS Papers and resources

Various Text-to-speech (TTS) papers and resources based on Deep-learning


Data

[Melspectrogram]

  • Speech Technology: A Practical Introduction, Topic: Spectrogram, Cepstrum and Mel-Frequency Analysis (K. Prahallad., CMU, slide, video)

Mel-spectrogram Generator

[Autoregressive]

RNN

  • Char2Wav (J. Soleto et. al., Feb. 2017., Motreal., paper)
  • Tacotron (Y. Wang et. al., Mar. 2017., Google, arxiv)
  • Tacotron 2 (J. Shen et.al., Dec. 2017., Google, arxiv)

CNN

  • Deep Voice 3 (W. Ping et. al., Oct. 2017., Baidu, arxiv)
  • Deep Convolutional Text-to-speech (H. Tachibana et. al., Oct. 2017., arxiv)

Transformer

  • Transformer TTS (N. Li et. al., Sep. 2018., Microsoft, arxiv)

[Non-autoregressive]

CNN

  • ParaNet (K. Peng et. al., May. 2019., Baidu, arxiv)

Transformer

  • Fast Speech (Y. Ren et. al., May. 2019., Microsoft, arxiv)
  • Align TTS (Z. Zeng et. al., Mar. 2020., Ping An Tech., arxiv)
  • Fast Speech 2 (Y. Ren et. al., Jun. 2020., Microsoft, arxiv)

[Graph Neural Networks]

  • Graph TTS (A. Sun et. al., Mar. 2020., Ping An Tech., arxiv)

[Attention Improvement]

  • Monotonic Attention (C. Raffel et. al., Jun. 2017., Google Brain, arxiv)
  • Monotonic Chunkwise Attention (C.C. Chiu et. al., Dec. 2017., Google Brain, arxiv)
  • Stepwise Monotonic Attention (M. He et. al., Jun. 2019., Microsoft, arxiv)
  • Location-relative Attention Mechanisms for Robust Long-form Speech Synthesis (E. Battenberg et. al., Oct. 2019., Google, arxiv)

[Training Algorithm]

  • A New GAN-based Training Algorithm (H. Guo. et. al., Apr. 2019., Microsoft, arxiv)

[Data-Efficient]

  • Semi-supervised Training for Improving Data Efficiency in End-to-end Speech Synthesis (Y.A. Chung et. al., Aug. 2018., MIT & Google, arxiv)
  • Sample Efficient Adaptive Text-to-Speech (Y. Chen., Jan. 2019., DeepMind & Google, arxiv)

Neural Vocoder

[Autoregressive Model]

  • WaveNet (A. V. Oord et. al., Sep. 2016., Deep Mind, arxiv)
  • SampleRNN (S. Mehri et. al., Dec. 2016, Montreal, arxiv)

[Inverse Autoregressive Flow Model]

  • Parallel WaveNet (A. V. Oord et. al., Nov. 2017., Deep Mind, arxiv)
  • ClariNet (W. Ping et. al., Jul. 2018., Baidu, arxiv)
  • WaveGlow (R. Prenger et. al., Nov. 2018., NVIDIA, arxiv)
  • FlowWaveNet (S. Kim et. al., Nov. 2018., SNU, arxiv)

[Generative Adversarial Network]

  • WaveGAN (C.Donahue et. al., Feb. 2018., UCSD, arxiv)
  • GAN-TTS (M. Binkowski, Sep. 2019., Google, arxiv)
  • Parallel WaveGAN (R. Yamamoto et. al., Oct. 2019., Naver, arxiv)

Style Modeling

[Style Token]

  • Uncovering Latent Style Factors for Expressive Speech Synthesis (Y. Wang et. al., Nov. 2017., Google, arxiv)
  • GST Tacotron (Y. Wang et. al., Mar. 2018., Google, arxiv)
  • TP-GST Tacotron (D. Santon et al., Aug. 2018., Google, arxiv)

[Generative Adversarial Network]

  • TTS-GAN (S. Ma et. al., Apr. 2019., Microsoft, paper)
  • Learning to Speak Fluently in a Foreign Language: Multilingual Speech Synthesis and Cross-Language Voice Cloning (Y. Zhang et. al., Jul. 2019., Google, arxiv)
  • Disentangling Correlated Speaker and Noise for Speech Synthesis via Data Augmentation and Adversarial Factorization (W. N. Hsu et. al., Sep. 2019., Google, paper)

[Mutual Information]

  • Unsupervised Style and Content Separation by Minimizing Mutual Information for Speech Synthesis (T.Y. Hu et. al., Mar. 2020., CMU & Apple, arxiv)

Dataset

[English]

ASR

  • LibriSoeech dataset (paper, download) (V. Panayotov et al., 2015)
    • total 2484 speakers, 1000+ hours
  • VocCeleb1 dataset (paper download) (A. Nagrani et al., 2017)
    • 151,516 utterances, 1251 speakers, 352 hours
  • VoxCeleb2 dataset (paper download) (J. S> Chung et al., 2018)
    • 1,128,246 utterances, 6112 speakers, 2442 hours

TTS

  • LJSpeech dataset download (Keith Ito and Linda Johnson, 2017)
    • single female speaker, 13100 samples, approximately 24 hours
  • CSTR VCTK Corpus (download (C. Veaux et. al.))
    • 109 english speakers with various accents, 400 speeches per speaker
  • Blizzard dataset download

[Korean]

ASR

  • 한국어 및 영어 음향모델 훈련용 음성 데이터 (download) (ETRI)
    • (Korean speech) 50 speakers * 100 speeches/speaker (total 5,000 speech samples)
    • (English speech pronounciated by Korean) 50 speakers * 100 speech (total 5,000 speech samples)
  • 음성인터페이스 개발을 위한 어린이 음성 데이터 (download) (ETRI)
    • 50 speakers * 100 speeches/speaker * 3 environments (total 16,200 speech samples)
    • speaker info: elementary school students (from 1st to 4th grade)
    • recorded from IPhone5, Samsung GalaxyS4, and microphones
  • ClovaCall datset (paper download) (Naver Corp.)
    • 140000+ speeches, 211+ hours of noisy and clean speech
  • KSponSpeech (download) (ETRI)
    • 2000 speakers, 1000+ hours, various topics(life, shopping, hobby, weather, etc..)
  • Korean Read Speech Corpus (download) (국립국어원)
    • 8 speakers, 120+ hours
  • Zeroth-Korean (download) (Lucas Jo and Wonkyum Lee)
    • 115 speakers, 52.8 hours, 22720 utterances in total
  • Pansori TED x KR Corpus (paper download) (Y. Choi and B. Lee)
    • 3 hours, 41 speakers

TTS

  • Korean Single Speaker (KSS) Speech Dataset (download) (K. Park)
    • single female speaker, 12853 samples, 12+ hours
  • 감정 음성합성 데이터셋 (download) ((주) 아크릴)
    • single female speaker, 7 emotion (neutral, sad, fear, happy, angry, disgusting, surprise), total 22,000 samples (about 3,000 samples per emotion)
  • EmotionTTS-Open-DB dataset (download) (KAIST and (주) 셀바스AI)
    • single-speaker, multi-speaker, and multi-speaker-multi-emotion dataset
  • 카이스트 오디오북 데이터셋 (download) (KAIST)
    • 58559 speeches, 72+ hours, 13 speakers
    • various reading materials (news, novel, etc..)