Using so-vits-svc-fork in a Docker Environment
This project demonstrates how to utilize so-vits-svc-fork within a Docker environment. Before diving in, there are some essential materials you should prepare:
- Speaker’s Voice: This is used for SVC (Speaker-Verification Code) training. Ensure that the voice recording is clean and single-source.
- Author’s Voice: This voice is employed to transform the Speaker’s voice into the Author’s voice.
Here are the voice sources I used for SVC training:
Additionally, if you need to convert MP3 files to WAV format, you can use this simple online converter:
- Install Docker: Make sure you have Docker installed on your computer.
- Check CPU Architecture: Note that PyTorch does not support ARM architecture. If you’re using a Mac M1, remember build
svc-arm64
container
To get started, let’s organize the voice files:
-
If you’ve just downloaded the Speaker’s voice, move it to the
storage/dataset_raw_raw
folder. We’ll split it into smaller pieces within the Docker container. -
If you already have pre-processed Speaker’s voice, move it to the
storage/dataset_raw
folder.
- Build docker image
cd hermeslin/so-vits-svc-tutorial
## if your CPU architecture is amd64
docker compose build svc-arm64
## Or just running
docker compose build svc
- Test the command, You should see the result below after running:
docker compose run svc-arm64 --help
Result:
Usage: svc [OPTIONS] COMMAND [ARGS]...
so-vits-svc allows any folder structure for training data.
However, the following folder structure is recommended.
When training: dataset_raw/{speaker_name}/**/{wav_name}.{any_format}
When inference: configs/44k/config.json, logs/44k/G_XXXX.pth
If the folder structure is followed, you DO NOT NEED TO SPECIFY model path, config path, etc.
(The latest model will be automatically loaded.)
To train a model, run pre-resample, pre-config, pre-hubert, train.
To infer a model, run infer.
Options:
-h, --help Show this message and exit.
Commands:
clean Clean up files, only useful if you are using the default file structure
gui Opens GUI for conversion and realtime inference
infer Inference
onnx Export model to onnx (currently not working)
pre-classify Classify multiple audio files into multiple files
pre-config Preprocessing part 2: config
pre-hubert Preprocessing part 3: hubert If the HuBERT model is not found, it will be...
pre-resample Preprocessing part 1: resample
pre-sd Speech diarization using pyannote.audio
pre-split Split audio files into multiple files
train Train model If D_0.pth or G_0.pth not found, automatically download from hub.
train-cluster Train k-means clustering
vc Realtime inference from microphone
## or you want to login to the container
docker compose run --rm --entrypoint /bin/bash svc-arm64
- If your Speaker’s voice audio file is too large for training, consider splitting it into smaller pieces. You can choose 10 to 20 representative pieces for training.
docker compose run svc-arm64 pre-split -o dataset_raw/Trump
- Pre-processing the Raw Dataset:
- Resample the audio files:
docker compose run svc-arm64 pre-resample
- Configure the dataset:
docker compose run svc-arm64 pre-config
- Choose an F0-method (e.g., crepe|crepe-tiny|parselmouth|dio|harvest). By default, svc uses
dio
:
docker compose run svc-arm64 pre-hubert -fm crepe
- Start Training. Training will take several minutes:
docker compose run svc-arm64 train -t
- Convert Speaker’s Voice to Author’s Voice:
docker compose run svc-arm64 infer audio/{YOUR_AUTHORS_VOICE_FILE_NAME}.wav
- View the Result:
- Your trained Author’s voice will appear in the audio folder as
audio/{YOUR_AUTHORS_VOICE_FILE_NAME}.out.wav
.