/Simplify-GPT-SoVITS

A project which simplifies the GPT-SoVITS

Primary LanguagePython

Simplified Voice-Clone

License

English| 中文简体 |

1. 简介

本项目对 GPT-SoVITSFishSpeechChatTTS进行精简,允许用户使用python代码进行简单地模型推理、训练

2. 安装

  1. 创建虚拟环境

    conda create -n gpt_sovits python=3.8
    conda activate gpt_sovits
  2. 安装torch

    pip install torch torchvision torchaudio
  3. 安装ffmpeg

    conda install ffmpeg
  4. 拉取项目并安装依赖

    git clone https://github.com/HanxSmile/Simplify-GPT-SoVITS.git
    cd Simplify-GPT-SoVITS
    pip install .
  5. 验证是否安装成功

    python -c "from gpt_sovits import Factory"

3. few-shot 模型推理

.1 GPT-SoVITS

  1. 下载预训练模型(可以参考原作者项目 gpt-sovits

    git lfs clone https://huggingface.co/lj1995/GPT-SoVITS
  2. 下载中文g2p模型并解压

    wget https://paddlespeech.bj.bcebos.com/Parakeet/released_models/g2p/G2PWModel_1.1.zip
    unzip G2PWModel_1.1.zip -d ./
  3. 修改模型配置,将上面下载的模型的路径填写到模型配置的相应位置

    config/gpt_sovits.yaml:

    model_cls: gpt_sovits
    
    hubert_model_name: GPT-SoVITS/chinese-hubert-base
    bert_model_name: GPT-SoVITS/chinese-roberta-wwm-ext-large
    t2s_model_name: GPT-SoVITS/gsv-v2final-pretrained/s1bert25hz-5kh-longer-epoch=12-step=369668.ckpt
    vits_model_name: GPT-SoVITS/gsv-v2final-pretrained/s2G2333k.pth
    cut_method: cut6
    text_converter:
      converter_cls: chinese_converter
      g2p_model_dir: G2PWModel_1.1
      g2p_tokenizer_dir: GPT-SoVITS/chinese-roberta-wwm-ext-large
    
    generate_cfg:
      placeholder: Null

    必须修改的字段

    字段 解释
    hubert_model_name hubert模型的路径
    bert_model_name bert模型的路径
    t2s_model_name AR模型的路径
    vits_model_name vits模型的路径
    text_converter.g2p_model_dir g2p模型的路径
    text_converter.g2p_tokenizer_dir g2p tokenizer 的目录(和bert_model_name一致)

    可以修改的字段:

    字段 解释
    cut_method 切分长句的方式(建议使用cut6,即按「,。?!...」切分)
  4. 收集参考音频文件与相应的文本内容

  5. 模型few-shot推理

    from gpt_sovits import Factory
    from gpt_sovits.utils import save_audio
    import os
    import uuid
    
    cfg = Factory.read_config("config/gpt_sovits.yaml")
    model = Factory.build_model(cfg)
    
    inputs = {
        "prompt_audio": "examples/linghua_90.wav",
        "prompt_text": "藏明刀的刀工,也被算作是本領通神的神士相關人員,歸屬統籌文化、藝術、祭祀的射鳳形意派管理。",
        "text": "明月几时有,把酒问青天"
    }
    model = model.cuda()
    sr, audio_data = model.generate(inputs)
    
    name = uuid.uuid4().hex
    output_dir = os.getcwd()
    output_file = os.path.join(output_dir, name + '.wav')
    
    output_file = save_audio(audio_data, sr, output_file)
    print(output_file)

3.2 FishSpeech

  1. 下载预训练模型(可以参考原作者项目FishSpeech

    git lfs clone https://huggingface.co/fishaudio/fish-speech-1.4
  2. 修改模型配置,将上面下载的模型的路径填写到模型配置的相应位置

    config/fishspeech.yaml:

    model_cls: fish_speech
    cut_method: cut6
    vqgan:
      model_cls: filefly_vqgan
      ckpt: fish-speech-1.4/firefly-gan-vq-fsq-8x1024-21hz-generator.pth
      spec_transform:
        sample_rate: 44100
        n_mels: 160
        n_fft: 2048
        hop_length: 512
        win_length: 2048
      backbone:
        input_channels: 160
        depths: [ 3, 3, 9, 3 ]
        dims: [ 128, 256, 384, 512 ]
        drop_path_rate: 0.2
        kernel_size: 7
      head:
        hop_length: 512
        upsample_rates: [ 8, 8, 2, 2, 2 ]
        upsample_kernel_sizes: [ 16, 16, 4, 4, 4 ]
        resblock_kernel_sizes: [ 3, 7, 11 ]
        resblock_dilation_sizes: [ [ 1, 3, 5 ], [ 1, 3, 5 ], [ 1, 3, 5 ] ]
        num_mels: 512
        upsample_initial_channel: 512
        pre_conv_kernel_size: 13
        post_conv_kernel_size: 13
      quantizer:
        input_dim: 512
        n_groups: 8
        n_codebooks: 1
        levels: [ 8, 5, 5, 5 ]
        downsample_factor: [ 2, 2 ]
    
    text2semantic:
      model_cls: dual_ar_transformer
      tokenizer_name: fish-speech-1.4/
      ckpt: fish-speech-1.4/model.pth
      model:
        attention_qkv_bias: False
        codebook_size: 1024
        dim: 1024
        dropout: 0.1
        head_dim: 64
        initializer_range: 0.02
        intermediate_size: 4096
        max_seq_len: 4096
        n_fast_layer: 4
        n_head: 16
        n_layer: 24
        n_local_heads: 2
        norm_eps: 1e-6
        num_codebooks: 8
        rope_base: 1e6
        tie_word_embeddings: False
        use_gradient_checkpointing: True
        vocab_size: 32000
    
    text_converter:
      converter_cls: chinese_fs_converter

    必须修改的字段:

    字段 解释
    vqgan.ckpt vqgan模型的路径
    text2semantic.ckpt text2semantic模型的路径
    text2semantic.tokenizer_name text2semantic模型使用的tokenizer的所在目录

    可以修改的字段:

    字段 解释
    cut_method 切分长句的方式(建议使用cut6,即按「,。?!...」切分)
  3. 收集参考音频文件与相应的文本内容

  4. 模型few-shot推理

    from gpt_sovits import Factory
    from gpt_sovits.utils import save_audio
    import os
    import uuid
    
    cfg = Factory.read_config("config/fishspeech.yaml")
    model = Factory.build_model(cfg)
    
    inputs = {
        "prompt_audio": "examples/linghua_90.wav",
        "prompt_text": "藏明刀的刀工,也被算作是本領通神的神士相關人員,歸屬統籌文化、藝術、祭祀的射鳳形意派管理。",
        "text": "明月几时有,把酒问青天"
    }
    model = model.cuda()
    sr, audio_data = model.generate(inputs)
    
    name = uuid.uuid4().hex
    output_dir = os.getcwd()
    output_file = os.path.join(output_dir, name + '.wav')
    
    output_file = save_audio(audio_data, sr, output_file)
    print(output_file)

4. Gradio Demo

step 1:下载预训练模型(可参考上文)

step 2:准备配置文件,把预训练模型的路径放在配置文件的对应位置(可参考上文),将所有的配置文件放在项目的config目录下

step 3:在项目目录下运行:python webui.py

Todo List

  • 模型推理:

    • GPT-SoVITS
    • FishSpeech
    • Chat-TTS
  • 模型训练

参考项目