wenet-e2e/speech-synthesis-paper

List of speech synthesis papers.

MIT

Speech Synthesis Paper

List of speech synthesis papers (-> more papers <-). Welcome to recommend more awesome papers 😀.

Repositories for collecting awesome speech paper:

awesome-speech-recognition-speech-synthesis-papers (from ponyzhang)
awesome-python-scientific-audio (from Fabian-Robert Stöter)
TTS-papers (from Eren Gölge)
awesome-speech-enhancement (from Vincent Liu)
speech-recognition-papers (from Xingchen Song)
awesome-tts-samples (from Seung-won Park)
awesome-speech-translation (from dqqcasia)
A Survey on Neural Speech Synthesis (from tts-tutorial)

What is the meaning of '★'? I add '★' to the papers which number of citations is over 50 (only in Acoustic Model, Vocoder and TTS towards Stylization). Beginner can read these paper first to get basic knowledge of Deep-Learning-based TTS model (#1).

Content

TTS Frontend
Acoustic Model
Vocoder
TTS towards Stylization
Voice Conversion
Singing
- Singing Voice Synthesis
- Singing Voice Conversion

TTS Frontend

Pre-trained Text Representations for Improving Front-End Text Processing in Mandarin Text-to-Speech Synthesis (Interspeech 2019)
A unified sequence-to-sequence front-end model for Mandarin text-to-speech synthesis (ICASSP 2020)
A hybrid text normalization system using multi-head self-attention for mandarin (ICASSP 2020)
Unified Mandarin TTS Front-end Based on Distilled BERT Model (2021-01)

Acoustic Model

Autoregressive Model

Tacotron V1^★: Tacotron: Towards End-to-End Speech Synthesis (Interspeech 2017)
Tacotron V2^★: Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions (ICASSP 2018)
Deep Voice V1^★: Deep Voice: Real-time Neural Text-to-Speech (ICML 2017)
Deep Voice V2^★: Deep Voice 2: Multi-Speaker Neural Text-to-Speech (NeurIPS 2017)
Deep Voice V3^★: Deep Voice 3: Scaling Text-to-Speech with Convolutional Sequence Learning (ICLR 2018)
Transformer-TTS^★: Neural Speech Synthesis with Transformer Network (AAAI 2019)
DurIAN: DurIAN: Duration Informed Attention Network For Multimodal Synthesis (2019)
Location-Relative Attention Mechanisms For Robust Long-Form Speech Synthesis (ICASSP 2020)
Flowtron (flow based): Flowtron: an Autoregressive Flow-based Generative Network for Text-to-Speech Synthesis (2020)
Non-Attentive Tacotron: Robust and Controllable Neural TTS Synthesis Including Unsupervised Duration Modeling (under review ICLR 2021)
RobuTrans (towards robust): RobuTrans: A Robust Transformer-Based Text-to-Speech Model (AAAI 2020)
DeviceTTS: DeviceTTS: A Small-Footprint, Fast, Stable Network for On-Device Text-to-Speech (2020-10)
Wave-Tacotron: Wave-Tacotron: Spectrogram-free end-to-end text-to-speech synthesis (2020-11)
Streaming Acoustic Modeling: Transformer-based Acoustic Modeling for Streaming Speech Synthesis (2021-06)
Apple TTS system: On-device neural speech synthesis (ASRU 2021)

Non-Autoregressive Model

ParaNet: Non-Autoregressive Neural Text-to-Speech (ICML 2020)
FastSpeech^★: FastSpeech: Fast, Robust and Controllable Text to Speech (NeurIPS 2019)
JDI-T: JDI-T: Jointly trained Duration Informed Transformer for Text-To-Speech without Explicit Alignment (2020)
EATS: End-to-End Adversarial Text-to-Speech (2020)
FastSpeech 2: FastSpeech 2: Fast and High-Quality End-to-End Text to Speech (2020)
FastPitch: FastPitch: Parallel Text-to-speech with Pitch Prediction (2020)
Glow-TTS (flow based, Monotonic Attention): Glow-TTS: A Generative Flow for Text-to-Speech via Monotonic Alignment Search (NeurIPS 2020)
Flow-TTS (flow based): Flow-TTS: A Non-Autoregressive Network for Text to Speech Based on Flow (ICASSP 2020)
SpeedySpeech: SpeedySpeech: Efficient Neural Speech Synthesis (Interspeech 2020)
Parallel Tacotron: Parallel Tacotron: Non-Autoregressive and Controllable TTS (2020)
BVAE-TTS: Bidirectional Variational Inference for Non-Autoregressive Text-to-Speech (ICLR 2021)
LightSpeech: LightSpeech: Lightweight and Fast Text to Speech with Neural Architecture Search (ICASSP 2021)
Parallel Tacotron 2: Parallel Tacotron 2: A Non-Autoregressive Neural TTS Model with Differentiable Duration Modeling (2021)
Grad-TTS: Grad-TTS: A Diffusion Probabilistic Model for Text-to-Speech (ICML 2021)
VITS (flow based): Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech (ICML 2021)
RAD-TTS: RAD-TTS: Parallel Flow-Based TTS with Robust Alignment Learning and Diverse Synthesis (ICML 2021 Workshop)
WaveGrad 2: WaveGrad 2: Iterative Refinement for Text-to-Speech Synthesis (Interspeech 2021)
PortaSpeech: PortaSpeech: Portable and High-Quality Generative Text-to-Speech (NeurIPS 2021)
DelightfulTTS (To synthesize natural and high-quality speech from text): DelightfulTTS: The Microsoft Speech Synthesis System for Blizzard Challenge 2021 (Blizzard Challenge 2021)
DiffGAN-TTS: DiffGAN-TTS: High-Fidelity and Efficient Text-to-Speech with Denoising Diffusion GANs (2022-01)
BDDM: Bilateral Denoising Diffusion Models for Fast and High-Quality Speech Synthesis (ICLR 2022)
JETS: JETS: Jointly Training FastSpeech2 and HiFi-GAN for End to End Text to Speech (Interspeech 2022)
WavThruVec: WavThruVec: Latent speech representation as intermediate features for neural speech synthesis (2022-03)
FastDiff: FastDiff: A Fast Conditional Diffusion Model for High-Quality Speech Synthesis (IJCAI 2022)
NaturalSpeech: NaturalSpeech: End-to-End Text to Speech Synthesis with Human-Level Quality (2022-05)
DelightfulTTS 2: DelightfulTTS 2: End-to-End Speech Synthesis with Adversarial Vector-Quantized Auto-Encoders (Interspeech 2022)
CLONE: Controllable and Lossless Non-Autoregressive End-to-End Text-to-Speech (2022-07)
ZET-Speech: ZET-Speech: Zero-shot adaptive Emotion-controllable Text-to-Speech Synthesis with Diffusion and Style-based Models (Interspeech 2023)

Alignment Study

Monotonic Attention^★: Online and Linear-Time Attention by Enforcing Monotonic Alignments (ICML 2017)
Monotonic Chunkwise Attention^★: Monotonic Chunkwise Attention (ICLR 2018)
Forward Attention in Sequence-to-sequence Acoustic Modelling for Speech Synthesis (ICASSP 2018)
RNN-T for TTS: Initial investigation of an encoder-decoder end-to-end TTS framework using marginalization of monotonic hard latent alignments (2019)
Location-Relative Attention Mechanisms For Robust Long-Form Speech Synthesis (ICASSP 2020)
Non-Attentive Tacotron: Robust and Controllable Neural TTS Synthesis Including Unsupervised Duration Modeling (under review ICLR 2021)
EfficientTTS: EfficientTTS: An Efficient and High-Quality Text-to-Speech Architecture (2020-12)
VAENAR-TTS: VAENAR-TTS: Variational Auto-Encoder based Non-AutoRegressive Text-to-Speech Synthesis (2021-07)
One TTS Alignment To Rule Them All (2021-08)

Data Efficiency

Semi-Supervised Training for Improving Data Efficiency in End-to-End Speech Synthesis (2018)
Almost Unsupervised Text to Speech and Automatic Speech Recognition (ICML 2019)
Unsupervised Learning For Sequence-to-sequence Text-to-speech For Low-resource Languages (Interspeech 2020)
Multilingual Speech Synthesis: One Model, Many Languages: Meta-learning for Multilingual Text-to-Speech (Interspeech 2020)
Low-resource expressive text-to-speech using data augmentation (2020-11)
One TTS Alignment To Rule Them All (2021-08)
DenoiSpeech: DenoiSpeech: Denoising Text to Speech with Frame-Level Noise Modeling (ICASSP 2021)
Revisiting Over-Smoothness in Text to Speech (ACL 2022)
Unsupervised Text-to-Speech Synthesis by Unsupervised Automatic Speech Recognition (2022-03)
Simple and Effective Unsupervised Speech Synthesis (2022-04)
A Multi-Stage Multi-Codebook VQ-VAE Approach to High-Performance Neural TTS (Interspeech 2022)
EPIC TTS Models (research on pruning): EPIC TTS Models: Empirical Pruning Investigations Characterizing Text-To-Speech Models (Interspeech 2022)

Vocoder

Autoregressive Model

WaveNet^★: WaveNet: A Generative Model for Raw Audio (2016)
WaveRNN^★: Efficient Neural Audio Synthesis (ICML 2018)
WaveGAN^★: Adversarial Audio Synthesis (ICLR 2019)
LPCNet^★: LPCNet: Improving Neural Speech Synthesis Through Linear Prediction (ICASSP 2019)
Towards achieving robust universal neural vocoding (Interspeech 2019)
GAN-TTS: High Fidelity Speech Synthesis with Adversarial Networks (2019)
MultiBand-WaveRNN: DurIAN: Duration Informed Attention Network For Multimodal Synthesis (2019)
Chunked Autoregressive GAN for Conditional Waveform Synthesis (2021-10)
Improved LPCNet: Neural Speech Synthesis on a Shoestring: Improving the Efficiency of LPCNet (ICASSP 2022)
Bunched LPCNet2: Bunched LPCNet2: Efficient Neural Vocoders Covering Devices from Cloud to Edge (2022-03)

Non-Autoregressive Model

Parallel-WaveNet^★: Parallel WaveNet: Fast High-Fidelity Speech Synthesis (2017)
WaveGlow^★: WaveGlow: A Flow-based Generative Network for Speech Synthesis (2018)
Parallel-WaveGAN^★: Parallel WaveGAN: A fast waveform generation model based on generative adversarial networks with multi-resolution spectrogram (2019)
MelGAN^★: MelGAN: Generative Adversarial Networks for Conditional Waveform Synthesis (NeurIPS 2019)
MultiBand-MelGAN: Multi-band MelGAN: Faster Waveform Generation for High-Quality Text-to-Speech (2020)
VocGAN: VocGAN: A High-Fidelity Real-time Vocoder with a Hierarchically-nested Adversarial Network (Interspeech 2020)
WaveGrad: WaveGrad: Estimating Gradients for Waveform Generation (2020)
DiffWave: DiffWave: A Versatile Diffusion Model for Audio Synthesis (2020)
HiFi-GAN: HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis (NeurIPS 2020)
Parallel-WaveGAN (New): Parallel waveform synthesis based on generative adversarial networks with voicing-aware conditional discriminators (2020-10)
StyleMelGAN: StyleMelGAN: An Efficient High-Fidelity Adversarial Vocoder with Temporal Adaptive Normalization (ICASSP 2021)
Improved parallel WaveGAN vocoder with perceptually weighted spectrogram loss (SLT 2021)
Fre-GAN: Fre-GAN: Adversarial Frequency-consistent Audio Synthesis (Interspeech 2021)
UnivNet: A Neural Vocoder with Multi-Resolution Spectrogram Discriminators for High-Fidelity Waveform Generation (2021-07)
iSTFTNet: iSTFTNet: Fast and Lightweight Mel-Spectrogram Vocoder Incorporating Inverse Short-Time Fourier Transform (ICASSP 2022)
Parallel Synthesis for Autoregressive Speech Generation (2022-04)
Avocodo: Avocodo: Generative Adversarial Network for Artifact-free Vocoder (2022-06)

Others

(Robust vocoder): Towards Robust Neural Vocoding for Speech Generation: A Survey (2019)
(Source-filter model based): Neural source-filter waveform models for statistical parametric speech synthesis (TASLP 2019)
NHV: Neural Homomorphic Vocoder (Interspeech 2020)
Universal MelGAN: Universal MelGAN: A Robust Neural Vocoder for High-Fidelity Waveform Generation in Multiple Domains (2020)
Binaural Speech Synthesis: Neural Synthesis of Binaural Speech From Mono Audio (ICLR 2021)
Checkerboard artifacts in neural vocoder: Upsampling artifacts in neural audio synthesis (ICASSP 2021)
Universal Vocoder Based on Parallel WaveNet: Universal Neural Vocoding with Parallel WaveNet (ICASSP 2021)
(Comparison of discriminator): GAN Vocoder: Multi-Resolution Discriminator Is All You Need (2021-03)
Vocoder Benchmark: VocBench: A Neural Vocoder Benchmark for Speech Synthesis (2021-12)
BigVGAN (Universal vocoder): BigVGAN: A Universal Neural Vocoder with Large-Scale Training (2022-06)

TTS towards Stylization

Expressive TTS

ReferenceEncoder-Tacotron^★: Towards End-to-End Prosody Transfer for Expressive Speech Synthesis with Tacotron (ICML 2018)
GST-Tacotron^★: Style Tokens: Unsupervised Style Modeling, Control and Transfer in End-to-End Speech Synthesis (ICML 2018)
Predicting Expressive Speaking Style From Text In End-To-End Speech Synthesis (2018)
GMVAE-Tacotron2^★: Hierarchical Generative Modeling for Controllable Speech Synthesis (ICLR 2019)
BERT-TTS: Towards Transfer Learning for End-to-End Speech Synthesis from Deep Pre-Trained Language Models (2019)
(Multi-style Decouple): Multi-Reference Neural TTS Stylization with Adversarial Cycle Consistency (2019)
(Multi-style Decouple): Multi-reference Tacotron by Intercross Training for Style Disentangling,Transfer and Control in Speech Synthesis (Interspeech 2019)
Mellotron: Mellotron: Multispeaker expressive voice synthesis by conditioning on rhythm, pitch and global style tokens (2019)
Robust and fine-grained prosody control of end-to-end speech synthesis (ICASSP 2019)
Flowtron (flow based): Flowtron: an Autoregressive Flow-based Generative Network for Text-to-Speech Synthesis (2020)
(local style): Fully-hierarchical fine-grained prosody modeling for interpretable speech synthesis (ICASSP 2020)
Controllable Neural Prosody Synthesis (Interspeech 2020)
GraphSpeech: GraphSpeech: Syntax-Aware Graph Attention Network For Neural Speech Synthesis (2020-10)
BERT-TTS: Improving Prosody Modelling with Cross-Utterance BERT Embeddings for End-to-end Speech Synthesis (2020-11)
(Global Emotion Style Control): Controllable Emotion Transfer For End-to-End Speech Synthesis (2020-11)
(Phone Level Style Control): Fine-grained Emotion Strength Transfer, Control and Prediction for Emotional Speech Synthesis (2020-11)
(Phone Level Prosody Modelling): Mixture Density Network for Phone-Level Prosody Modelling in Speech Synthesis (ICASSP 2021)
(Phone Level Prosody Modelling): Prosodic Clustering for Phoneme-level Prosody Control in End-to-End Speech Synthesis (ICASSP 2021)
PeriodNet: PeriodNet: A non-autoregressive waveform generation model with a structure separating periodic and aperiodic components (ICASSP 2021)
PnG BERT: PnG BERT: Augmented BERT on Phonemes and Graphemes for Neural TTS (Interspeech 2021)
Towards Multi-Scale Style Control for Expressive Speech Synthesis (2021-04)
Learning Robust Latent Representations for Controllable Speech Synthesis (2021-05)
Diverse and Controllable Speech Synthesis with GMM-Based Phone-Level Prosody Modelling (2021-05)
Improving Performance of Seen and Unseen Speech Style Transfer in End-to-end Neural TTS (2021-06)
(Conversational Speech Synthesis): Controllable Context-aware Conversational Speech Synthesis (Interspeech 2021)
DeepRapper: DeepRapper: Neural Rap Generation with Rhyme and Rhythm Modeling (ACL 2021)
Referee: Referee: Towards reference-free cross-speaker style transfer with low-quality data for expressive speech synthesis (2021)
(Text-Based Insertion TTS): Zero-Shot Text-to-Speech for Text-Based Insertion in Audio Narration (Interspeech 2021)
On the Interplay Between Sparsity, Naturalness, Intelligibility, and Prosody in Speech Synthesis (2021-10)
Style Equalization: Unsupervised Learning of Controllable Generative Sequence Models (2021-10)
TTS for dubbing: Neural Dubber: Dubbing for Videos According to Scripts (NeurIPS 2021)
Word-Level Style Control for Expressive, Non-attentive Speech Synthesis (SPECOM 2021)
MsEmoTTS: MsEmoTTS: Multi-scale emotion transfer, prediction, and control for emotional speech synthesis (2022-01)
Disentangling Style and Speaker Attributes for TTS Style Transfer (2022-01)
Word-level prosody modeling: Unsupervised word-level prosody tagging for controllable speech synthesis (ICASSP 2022)
ProsoSpeech: ProsoSpeech: Enhancing Prosody With Quantized Vector Pre-training in Text-to-Speech (ICASSP 2022)
CampNet (speech editing):CampNet: Context-Aware Mask Prediction for End-to-End Text-Based Speech Editing (2022-02)
vTTS (visual text): vTTS: visual-text to speech (2022-03)
CopyCat2: CopyCat2: A Single Model for Multi-Speaker TTS and Many-to-Many Fine-Grained Prosody Transfer (Interspeech 2022)
Prosody Cloning in Zero-Shot Multispeaker Text-to-Speech (Interspeech 2022)
Expressive, Variable, and Controllable Duration Modelling in TTS (Interspeech 2022)

MultiSpeaker TTS

Meta-Learning for TTS^★: Sample Efficient Adaptive Text-to-Speech (ICLR 2019)
SV-Tacotron^★: Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis (NeurIPS 2018)
Deep Voice V3^★: Deep Voice 3: Scaling Text-to-Speech with Convolutional Sequence Learning (ICLR 2018)
Zero-Shot Multi-Speaker Text-To-Speech with State-of-the-art Neural Speaker Embeddings (ICASSP 2020)
MultiSpeech: MultiSpeech: Multi-Speaker Text to Speech with Transformer (2020)
SC-WaveRNN: Speaker Conditional WaveRNN: Towards Universal Neural Vocoder for Unseen Speaker and Recording Conditions (Interspeech 2020)
MultiSpeaker Dataset: AISHELL-3: A Multi-speaker Mandarin TTS Corpus and the Baselines (2020)
Life-long learning for multi-speaker TTS: Continual Speaker Adaptation for Text-to-Speech Synthesis (2021-03)
Meta-StyleSpeech : Multi-Speaker Adaptive Text-to-Speech Generation (ICML 2021)
Effective and Differentiated Use of Control Information for Multi-speaker Speech Synthesis (Interspeech 2021)
Speaker Generation (2021-11)
Meta-Voice: Meta-Voice: Fast few-shot style transfer for expressive voice cloning using meta learning (2021-11)

New Perspective on TTS

PromptTTS: PromptTTS: Controllable Text-to-Speech with Text Descriptions (2022-11)
VALL-E: Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers (2023-01)
InstructTTS: InstructTTS: Modelling Expressive TTS in Discrete Latent Space with Natural Language Style Prompt (2023-01)
Spear-TTS: Speak, Read and Prompt: High-Fidelity Text-to-Speech with Minimal Supervision (2023-02)
FoundationTTS: FoundationTTS: Text-to-Speech for ASR Customization with Generative Language Model (2023-03)

Voice Conversion

ASR & TTS Based

(introduce PPG into voice conversion): Phonetic posteriorgrams for many-to-one voice conversion without parallel data training (2016)
A Vocoder-free WaveNet Voice Conversion with Non-Parallel Data (2019)
TTS-Skins: TTS Skins: Speaker Conversion via ASR (2019)
Non-Parallel Sequence-to-Sequence Voice Conversion with Disentangled Linguistic and Speaker Representations (IEEE/ACM TASLP 2019)
One-shot Voice Conversion by Separating Speaker and Content Representations with Instance Normalization (Interspeech 2019)
Cotatron (combine text information with voice conversion system): Cotatron: Transcription-Guided Speech Encoder for Any-to-Many Voice Conversion without Parallel Data (Interspeech 2020)
(TTS & ASR): Voice Conversion by Cascading Automatic Speech Recognition and Text-to-Speech Synthesis with Prosody Transfer (Interspeech 2020)
FragmentVC (wav to vec): FragmentVC: Any-to-Any Voice Conversion by End-to-End Extracting and Fusing Fine-Grained Voice Fragments With Attention (2020)
Towards Natural and Controllable Cross-Lingual Voice Conversion Based on Neural TTS Model and Phonetic Posteriorgram (ICASSP 2021)
(TTS & ASR): On Prosody Modeling for ASR+TTS based Voice Conversion (2021-07)
Cloning one's voice using very limited data in the wild (2021-10)

VAE & Auto-Encoder Based

VAE-VC (VAE based): Voice Conversion from Non-parallel Corpora Using Variational Auto-encoder (2016)
(Speech representation learning by VQ-VAE): Unsupervised speech representation learning using WaveNet autoencoders (2019)
Blow (Flow based): Blow: a single-scale hyperconditioned flow for non-parallel raw-audio voice conversion (NeurIPS 2019)
AutoVC: AUTOVC: Zero-Shot Voice Style Transfer with Only Autoencoder Loss (2019)
F0-AutoVC: F0-consistent many-to-many non-parallel voice conversion via conditional autoencoder (ICASSP 2020)
One-Shot Voice Conversion by Vector Quantization (ICASSP 2020)
SpeechSplit (auto-encoder): Unsupervised Speech Decomposition via Triple Information Bottleneck (ICML 2020)
NANSY: Neural Analysis and Synthesis: Reconstructing Speech from Self-Supervised Representations (NeurIPS 2021)

GAN Based

CycleGAN-VC V1: Parallel-Data-Free Voice Conversion Using Cycle-Consistent Adversarial Networks (2017)
StarGAN-VC: StarGAN-VC: non-parallel many-to-many Voice Conversion Using Star Generative Adversarial Networks (2018)
CycleGAN-VC V2: CycleGAN-VC2: Improved CycleGAN-based Non-parallel Voice Conversion (2019)
CycleGAN-VC V3: CycleGAN-VC3: Examining and Improving CycleGAN-VCs for Mel-spectrogram Conversion (2020)
MaskCycleGAN-VC: MaskCycleGAN-VC: Learning Non-parallel Voice Conversion with Filling in Frames (ICASSP 2021)

Singing

Singing Voice Synthesis

XiaoIce Band: XiaoIce Band: A Melody and Arrangement Generation Framework for Pop Music (KDD 2018)
Mellotron: Mellotron: Multispeaker expressive voice synthesis by conditioning on rhythm, pitch and global style tokens (2019)
ByteSing: ByteSing: A Chinese Singing Voice Synthesis System Using Duration Allocated Encoder-Decoder Acoustic Models and WaveRNN Vocoders (2020)
JukeBox: Jukebox: A Generative Model for Music (2020)
XiaoIce Sing: XiaoiceSing: A High-Quality and Integrated Singing Voice Synthesis System (2020)
HiFiSinger: HiFiSinger: Towards High-Fidelity Neural Singing Voice Synthesis (2019)
Sequence-to-sequence Singing Voice Synthesis with Perceptual Entropy Loss (2020)
Learn2Sing: Learn2Sing: Target Speaker Singing Voice Synthesis by learning from a Singing Teacher (2020-11)
MusicBERT: MusicBERT: Symbolic Music Understanding with Large-Scale Pre-Training (ACL 2021)
SingGAN (Singing Voice Vocoder): SingGAN: Generative Adversarial Network For High-Fidelity Singing Voice Generation (AAAI 2022)
Background music generation: Video Background Music Generation with Controllable Music Transformer (ACM Multimedia 2021)
Multi-Singer (Singing Voice Vocoder): Multi-Singer: Fast Multi-Singer Singing Voice Vocoder With A Large-Scale Corpus (ACM Multimedia 2021)
Rapping-singing voice synthesis: Rapping-Singing Voice Synthesis based on Phoneme-level Prosody Control (SSW 11)
VISinger (VIST for Singing Voice Synthesis): VISinger: Variational Inference with Adversarial Learning for End-to-End Singing Voice Synthesis (2021-10)
Opencpop: Opencpop: A High-Quality Open Source Chinese Popular Song Corpus for Singing Voice Synthesis (2022-01)
Learning the Beauty in Songs: Neural Singing Voice Beautifier (ACL 2022)
Learn2Sing 2.0: Diffusion and Mutual Information-Based Target Speaker SVS by Learning from Singing Teacher (2022-03)
MusicLM: MusicLM: Generating Music From Text (2023-01)
SingSong: SingSong: Generating musical accompaniments from singing (2023-01)

Singing Voice Conversion

A Universal Music Translation Network (2018)
Unsupervised Singing Voice Conversion (Interspeech 2019)
PitchNet: PitchNet: Unsupervised Singing Voice Conversion with Pitch Adversarial Network (ICASSP 2020)
DurIAN-SC: DurIAN-SC: Duration Informed Attention Network based Singing Voice Conversion System (Interspeech 2020)
Speech-to-Singing Conversion based on Boundary Equilibrium GAN (Interspeech 2020)
PPG-based singing voice conversion with adversarial representation learning (2020)