heatz123/naturalspeech

Inference code

Murats7 opened this issue · 6 comments

Hi can you provide a script for inference ?

Sure. Here's a inference script which inputs some text and outputs result.wav.

from models.models import (
    SynthesizerTrn,
)

from text.symbols import symbols

from utils import utils
from text import text_to_sequence, cleaned_text_to_sequence

from utils import commons
import torch
import scipy


def get_text(text, hps):
    text_norm = text_to_sequence(text, hps.data.text_cleaners)
    if hps.data.add_blank:
        text_norm = commons.intersperse(text_norm, 0)
    text_norm = torch.LongTensor(text_norm)
    return text_norm


hps = utils.get_hparams_from_file('configs/ljs.json')
model_path = './G_3190.pth' # you should change model_path
text = 'your input text goes here' # and text

net_g = SynthesizerTrn(
    len(symbols),
    hps.data.filter_length // 2 + 1,
    hps.train.segment_size // hps.data.hop_length,
    hps.models,
).cuda(0)

net_g.attach_memory_bank(hps.models)

_, _, _, epoch_str = utils.load_checkpoint(
    model_path, net_g, None
)

net_g.eval()

x = get_text(text, hps).cuda().unsqueeze(0)
x_lengths = torch.LongTensor([x.size(1)]).cuda()

with torch.no_grad():
    y_hat, mask, *_ = net_g.infer(x, x_lengths, noise_scale=0.667, length_scale=1.1, max_len=1200)
    audio = y_hat[0, 0, :].cpu().numpy()

scipy.io.wavfile.write(
    filename="result.wav",
    rate=hps.data.sampling_rate,
    data=audio,
)

Note that you have to change model_path and text in the script.
I'll be uploading the inference code in this repo, with some pretrained model, maybe after some days.

Thank you for your work.

@heatz123 In the inference, we want to have a long text to synthesis, but the max_len= 1000 in learning_upsample function.
If i use more than 1000, the training is being better or not? What about the inference ?

@yiwei0730, the impact of increasing the max_len parameter beyond 1000 frames depends on the dataset you are using. For instance, LJSpeech dataset does not contain any samples longer than 1000 frames (equivalent to 22050 / 256 * 1000 seconds). The primary reason for setting a max_len parameter is to prevent out-of-memory errors. If you want to generate longer sentences, you can adjust the max_len parameter to any value you desire.

To some extent, the model (learnable upsampling module) will generalize with long sentences. However, generating excessively long sentences (such as >2000 frames) may result in reduced sample quality due to the potential mismatch with the training dataset, which only contains samples with a maximum length of 1000 frames.

If you intend to generate long texts using this model, I suggest breaking down the input text into smaller sentences and performing inference on each of them. This approach aligns with the way the LJSpeech dataset is constructed and reduces the potential mismatch between the training and inference data.

I hope this helps!

@heatz123 Thanks for your reply, I see. I have another question about the naturalspeech.
First question : Did you ever try the model for the Finetune using, since If we train LJSpeech for a week in 1500 epochs, then can i just continue finetune another speaker in 200 epochs and get a good result?
Second question : Can I train a Mandarin Model or mix-language model in naturalspeech? How many place I should change about? (preprocess_text, symbols, text.cleaner, is any place i miss?)

@heatz123 @yiwei0730 I am unable to make an inference call using the above code. I keep getting segmentation fault(core dumped). After doing some debugging, I found that it is failing at the return statement at below part of file api.py under espeak lib

def text_to_phonemes(self, text_ptr, text_mode, phonemes_mode):
        f_text_to_phonemes = self._library.espeak_TextToPhonemes
        f_text_to_phonemes.restype = ctypes.c_char_p
        f_text_to_phonemes.argtypes = [
            ctypes.POINTER(ctypes.c_char_p),
            ctypes.c_int,
            ctypes.c_int]
        return f_text_to_phonemes(text_ptr, text_mode, phonemes_mode)

Any help on this is appreciated. My espeak version is 1.47.11, torch version is 2.0.1 and I am using a centos7 server