/vocos-official

Vocos: Closing the gap between time-domain and Fourier-based neural vocoders for high-quality audio synthesis

Primary LanguageJupyter NotebookMIT LicenseMIT

Vocos

ColabBadge PaperBadge

Clone of official Vocos, Frame-level vocoder based on Fourier-basis.

Demo

official demo

Usage

Install

# pip install "torch==2.0.0" -q      # Based on your environment (validated with vX.YZ)
# pip install "torchaudio==2.0.1" -q # Based on your environment
pip install git+https://github.com/tarepan/vocos-official

Inference

mel-to-wave Resynthesis

import torchaudio
from vocos import Vocos

vocos = Vocos.from_pretrained("charactr/vocos-mel-24khz")

y, sr = torchaudio.load(YOUR_AUDIO_FILE)
if y.size(0) > 1:  # mix to mono
    y = y.mean(dim=0, keepdim=True)
y = torchaudio.functional.resample(y, orig_freq=sr, new_freq=24000)
y_hat = vocos(y)

Reconstruct audio from EnCodec tokens

Additionally, you need to provide a bandwidth_id which corresponds to the embedding for bandwidth from the list: [1.5, 3.0, 6.0, 12.0].

vocos = Vocos.from_pretrained("charactr/vocos-encodec-24khz")

audio_tokens = torch.randint(low=0, high=1024, size=(8, 200))  # 8 codeboooks, 200 frames
features = vocos.codes_to_features(audio_tokens)
bandwidth_id = torch.tensor([2])  # 6 kbps

audio = vocos.decode(features, bandwidth_id=bandwidth_id)

Copy-synthesis from a file: It extracts and quantizes features with EnCodec, then reconstructs them with Vocos in a single forward pass.

y, sr = torchaudio.load(YOUR_AUDIO_FILE)
if y.size(0) > 1:  # mix to mono
    y = y.mean(dim=0, keepdim=True)
y = torchaudio.functional.resample(y, orig_freq=sr, new_freq=24000)

y_hat = vocos(y, bandwidth_id=bandwidth_id)

Integrate with 🐶 Bark text-to-audio model

See example notebook.

Pre-trained models

Improved versions (2500K steps)

Model Name Dataset Training Iterations Parameters
charactr/vocos-mel-24khz LibriTTS 2.5 M 13.5 M
charactr/vocos-encodec-24khz DNS Challenge 2.5 M 7.9 M

Train

Jump to ☞ ColabBadge, then Run. That's all!

Results

Sample

Demo

Performance

  • training
    • 9.2 [iter/sec] @ NVIDIA A100 on paperspace gradient Notebook (MatMulTF32+/ConvTF32+/AMP+)
    • take about 1.3 days for whole training
  • inference
    • batch RTF 18.7 @ Google Colab CPU instance (<=2cores, nb=1/L=5sec)
    • stream RTF 1.5 @ Google Colab CPU instance (<=2cores, nb=1/hop=10msec)

References

Original paper

PaperBadge

@misc{2306.00814,
Author = {Hubert Siuzdak},
Title = {Vocos: Closing the gap between time-domain and Fourier-based neural vocoders for high-quality audio synthesis},
Year = {2023},
Eprint = {arXiv:2306.00814},
}

Info from official Vocos

  • mrd_loss_coeff=1.0 might be better than default 0.1

Also, you might want to set the mrd_loss_coeff to 1.0 right from the start. In my experience, it does slow down the convergence in terms of the UTMOS score a bit, but it's key for reducing the buzziness in the audio output.
issue#3

Acknowlegements