This is the implementation for our paper submitted to ICASSP 2023: "CAN KNOWLEDGE OF END-TO-END TEXT-TO-SPEECH MODELS IMPROVE NEURAL MIDI-TO-AUDIO SYNTHESIS SYSTEMS?"
Xuan Shi, Erica Cooper, Xin Wang, Junichi Yamagishi, Shrikanth Narayanan
The audio samples are uploaded to this website: https://nii-yamagishilab.github.io/sample-midi-to-audio/.
It is appreciated if you can cite this paper when the idea, code, and pretrained model are helpful to your research.
The code for model training was based on the ESPnet-TTS project: "ESPnet-TTS: Unified, reproducible, and integratable open source end-to-end text-to-speech toolkit," ICASSP 2020 Tomoki Hayashi, Ryuichi Yamamoto, Katsuki Inoue, Takenori Yoshimura, Shinji Watanabe, Tomoki Toda, Kazuya Takeda, Yu Zhang, and Xu Tan
The data for all experiments (training and inference) is the MAESTRO dataset: "Enabling factorized piano music modeling and generation with the MAESTRO dataset," ICLR 2019 Curtis Hawthorne, Andriy Stasyuk, Adam Roberts, Ian Simon, Cheng-Zhi Anna Huang, Sander Dieleman, Erich Elsen, JesseEngel, and Douglas Eck
This work consists of a MIDI-to-mel component based on Transformer-TTS: "Neural speech synthesis with transformer network," AAAI 2019 Naihan Li, Shujie Liu, Yanqing Liu, Sheng Zhao, and Ming Liu and a HiFiGAN-based mel-to-audio component: "HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis," NeurIPS 2020 Jungil Kong, Jaehyeon Kim, and Jaekyoung Bae The two components were first separately trained, and then jointly fine-tuned for an additional 200K steps.
It is recommended to follow the official installation to set up a complete ESPnet2 environment for model training.
- Setup kaldi
$ cd <midi2wav-root>/tools
$ ln -s <kaldi-root> .
- Setup Python environment. There are 4 types of setup method, we strongly suggest the first one.
$ cd <midi2wav-root>/tools
$ CONDA_TOOLS_DIR=$(dirname ${CONDA_EXE})/..
$ ./setup_anaconda.sh ${CONDA_TOOLS_DIR} [conda-env-name] [python-version]
# e.g.
$ ./setup_anaconda.sh ${CONDA_TOOLS_DIR} midi2wav 3.9
- Install ESPnet
$ cd <midi2wav-root>/tools
$ make TH_VERSION=1.8 CUDA_VERSION=11.1
Make sure the espnet version is espnet==0.10
.
After the midi2wav
conda environment set up, we will install packages for music processing and re-install some packages to avoid potential package version conflicts.
Suggested dependencies:
pretty_midi==0.2.9
wandb==0.12.9
protobuf==3.19.3
First, make the directory to save the pre-trained model.
$ cd <midi2wav-root>/egs2/maestro/tts1
$ mkdir -p exp/tts_finetune_joint_transformer_hifigan_raw_proll
$ cd exp/tts_finetune_joint_transformer_hifigan_raw_proll
Then, download the pre-trained model from Zenodo, rename the model as train.loss.ave.pth
, and save it under the directory mentioned above.
The midi-to-wav scripts are developed based on kaldi-style ESPnet. The main work directory is at <midi2wav-root>/egs2/maestro/tts1
.
There are 7 main stages:
- 1~4: Data Preparation
- 5: Stats Collection
- 6: Model Training
- 7: Model Inference
Experimental environment setting:
./run.sh --stage 1 --stop_stage 5 --ngpu ${num_gpu} --tts_task mta --train_config ./conf/train.yaml
Model training (Acoustic Model):
./run.sh --stage 6 --stop_stage 6 --ngpu ${num_gpu} --tts_task mta --train_config ./conf/train.yaml
Model training (Synthesizer or Joint training):
./run.sh --stage 6 --stop_stage 6 --ngpu ${num_gpu} --tts_task gan_mta --train_config ./conf/tuning/finetune_joint_transformer_hifigan.yaml
Model inference (Synthesizer or Joint training):
./run.sh --stage 7 --stop_stage 7 --skip_data_prep true --ngpu ${num_gpu} --tts_task gan_mta --train_config ./conf/tuning/finetune_joint_transformer_hifigan.yaml
This study is supported by the Japanese-French joint national project called VoicePersonae, JST CREST (JPMJCR18A6, JPMJCR20D3), MEXT KAKENHI Grants (21K17775, 21H04906, 21K11951), Japan, and Google AI for Japan program.
The code is licensed under Apache License Version 2.0, following ESPnet. The pretrained model is licensed under the Creative Commons License: Attribution 4.0 International http://creativecommons.org/licenses/by/4.0/legalcode