Welcome to the code repository for Kotoba-Speech v0.1, a 1.2B Transformer-based speech generative model designed for generating fluent Japanese speech. This model represents one of the most advanced open-source options available in the field.
Questions, feature requests, or bug reports? Join our Discord community!
Kotoba-Speech Version 0.1 distinguishes itself as an open-source solution for generating high-quality Japanese speech from text prompts, while also offering the capability for voice cloning through speech prompts.
- Demo: Experience Kotoba-Speech in action here.
- Model Checkpoint: Access our commercially usable pre-trained model here.
- Open-sourced Code: This repository opensources the training and inference code, along with the Gradio demo code. We borrow code from MetaVoice as a starting point.
kotoba-speech_demo.mov
kansai_demo.mov
- Installation
- Preparing Datasets
- Training
- Inference
- Other Notes
# Installing ffmpeg
wget https://johnvansickle.com/ffmpeg/builds/ffmpeg-git-amd64-static.tar.xz
wget https://johnvansickle.com/ffmpeg/builds/ffmpeg-git-amd64-static.tar.xz.md5
md5sum -c ffmpeg-git-amd64-static.tar.xz.md5
tar xvf ffmpeg-git-amd64-static.tar.xz
sudo mv ffmpeg-git-*-static/ffprobe ffmpeg-git-*-static/ffmpeg /usr/local/bin/
rm -rf ffmpeg-git-*
# Setting-up Python virtual environment
python -m venv myenv
source myenv/bin/activate
pip install -U --pre torch torchaudio --index-url https://download.pytorch.org/whl/nightly/cu121
pip install -r requirements.txt
pip install flash-attn==2.5.3
pip install -e .
We provide an example of preparing datasets to train our model. We use Reazon Speech, the largest open-sourced Japanese speech dataset, as an example. (Note that our model is not necessary trained on Reazon Speech solely.)
# Download & Format Data
python preprocess/download_reazon.py
# Pre-calculate Speaker Embeddings
python preprocess/spk_embed.py
# Tokenize Audio
python preprocess/audio_tokenize.py
# Tokenize Text Captions
python preprocess/text_tokenize.py
# Split data into (training/validation/test)
python preprocess/split.py
# Fine-tuning from our pre-trained checkpoint
# Replace YOUR_WANDB_ENTITY and YOUR_WANDB_PROJECT
python fam/llm/train.py --num_gpus 1 --batch_size 32 --per_gpu_batchsize 2 --max_epoch 5 --learning_rate 0.00005 --data_dir data --exp_name reazon_small_exp_finetuning --spkemb_dropout 0.1 --check_val_every_n_epoch 1 --wandb_entity YOUR_WANDB_ENTITY --wandb_project YOUR_WANDB_PROJECT --use_wandb
# Multi-GPU Fine-tuning (e.g., using 2 GPUs)
# Replace YOUR_WANDB_ENTITY and YOUR_WANDB_PROJECT
python fam/llm/train.py --num_gpus 2 --batch_size 32 --per_gpu_batchsize 2 --max_epoch 5 --learning_rate 0.00005 --data_dir data --exp_name reazon_small_exp_finetuning --spkemb_dropout 0.1 --check_val_every_n_epoch 1 --wandb_entity YOUR_WANDB_ENTITY --wandb_project YOUR_WANDB_PROJECT --use_wandb
# Fine-tuning (without WandB logging)
python fam/llm/train.py --num_gpus 1 --batch_size 32 --per_gpu_batchsize 2 --max_epoch 5 --learning_rate 0.00005 --data_dir data --exp_name reazon_small_exp_finetuning --spkemb_dropout 0.1 --check_val_every_n_epoch 1
# Training from scratch
# Replace YOUR_WANDB_ENTITY and YOUR_WANDB_PROJECT
python fam/llm/train.py --num_gpus 1 --batch_size 64 --per_gpu_batchsize 2 --max_epoch 20 --learning_rate 0.0001 --data_dir data --exp_name reazon_small_exp --spkemb_dropout 0.1 --check_val_every_n_epoch 1 --wandb_entity YOUR_WANDB_ENTITY --wandb_project YOUR_WANDB_PROJECT --use_wandb --train_from_scratch
# Our Pre-trained Checkpoint
python -i fam/llm/fast_inference.py --model_name kotoba-tech/kotoba-speech-v0.1
tts.synthesise(text="コトバテクノロジーズのミッションは音声基盤モデルを作る事です。", spk_ref_path="assets/bria.mp3")
# Inference from Our Pre-trained Checkpoint (関西弁)
python -i fam/llm/fast_inference.py --model_name kotoba-tech/kotoba-speech-v0.1-kansai
tts.synthesise(text="コトバテクノロジーズのミッションは音声基盤モデルを作る事です。", spk_ref_path="assets/bria.mp3")
# Inference from Your Own Pre-trained Checkpoint
# YOUR_CHECKPOINT_PATH is something like /home/checkpoints/epoch=0-step=1810.ckpt
python -i fam/llm/fast_inference.py --first_model_path YOUR_CHECKPOINT_PATH
tts.synthesise(text="コトバテクノロジーズのミッションは音声基盤モデルを作る事です。", spk_ref_path="assets/bria.mp3")
- See all active issues!
- Write an explanation about multi-node training
- Integrade a gradio demo
We thank MetaVoice for releasing their code and their English pre-trained model.