/SpeechInterface

A Speech Interface Toolkit for Neural Speech Synthesis

Primary LanguagePythonMIT LicenseMIT

SpeechInterface

Python 3.6 Hits

A Speech Interface Toolkit for Neural Speech Synthesis with Pytorch

This repository is made for deploying your neural speech synthesis experiments efficiently. The main feature is defined as:

  • Matching audio feature parameters and their source codes for using major neural vocoders

  • They called an interface, which has encode and decode function.

    • Encode: Convert raw waveform to audio features. (e.g. mel-spectrogram, mfcc ...)

    • Decode: Reconstruct audio features to raw waveform. (i.e. neural vocoder)

  • Usage Examples
    • Compare experimental results of neural vocoder with others
    • Use directly audio features and neural vocoders for neural speech synthesis models

Install

$ pip install speech_interface

Available neural vocoders

  1. Hifi-GAN (Universal v1, VCTK, LJSpeech) : speech_interface.interfaces.hifi_gan.InterfaceHifiGAN
  2. MelGAN (Multi Speaker and LJSpeech from official repository) : speech_interface.interfaces.mel_gan.InterfaceMelGAN
  3. WaveGlow (LJSpeech) (Universal will be added after solving import error) : speech_interface.interfaces.waveglow.InterfaceWaveGlow
  4. Multi-band MelGAN (VCTK, LJSpeech) : speech_interface.interfaces.multiband_mel_gan.InterfaceMultibandMelGAN

Example

  • Use an interface
import librosa
import torch
from speech_interface.interfaces.hifi_gan import InterfaceHifiGAN

# Make an interface
model_name = 'hifi_gan_v1_universal' 
device = 'cuda'
interface = InterfaceHifiGAN(model_name=model_name, device=device)

wav, sr = librosa.load('/your/wav/form/file/path')

# to pytorch tensor
wav_tensor = torch.from_numpy(wav).unsqueeze(0)  # (1, Tw)

# encode waveform tensor
features = interface.encode(wav_tensor)

# your speech synthesis process ...
# ...

# reconstruct waveform
pred_wav_tensor = interface.decode(features)
  • Checkout available models and audio parameters
from speech_interface.interfaces.hifi_gan import InterfaceHifiGAN

# available models
print(InterfaceHifiGAN.available_models())

# audio parameters
print(InterfaceHifiGAN.audio_params())

Reference

License

This repository is under MIT license.