Clone of official Vocos, Frame-level vocoder based on Fourier-basis.
# pip install "torch==2.0.0" -q # Based on your environment (validated with vX.YZ)
# pip install "torchaudio==2.0.1" -q # Based on your environment
pip install git+https://github.com/tarepan/vocos-official
import torchaudio
from vocos import Vocos
vocos = Vocos.from_pretrained("charactr/vocos-mel-24khz")
y, sr = torchaudio.load(YOUR_AUDIO_FILE)
if y.size(0) > 1: # mix to mono
y = y.mean(dim=0, keepdim=True)
y = torchaudio.functional.resample(y, orig_freq=sr, new_freq=24000)
y_hat = vocos(y)
Additionally, you need to provide a bandwidth_id
which corresponds to the embedding for bandwidth from the
list: [1.5, 3.0, 6.0, 12.0]
.
vocos = Vocos.from_pretrained("charactr/vocos-encodec-24khz")
audio_tokens = torch.randint(low=0, high=1024, size=(8, 200)) # 8 codeboooks, 200 frames
features = vocos.codes_to_features(audio_tokens)
bandwidth_id = torch.tensor([2]) # 6 kbps
audio = vocos.decode(features, bandwidth_id=bandwidth_id)
Copy-synthesis from a file: It extracts and quantizes features with EnCodec, then reconstructs them with Vocos in a single forward pass.
y, sr = torchaudio.load(YOUR_AUDIO_FILE)
if y.size(0) > 1: # mix to mono
y = y.mean(dim=0, keepdim=True)
y = torchaudio.functional.resample(y, orig_freq=sr, new_freq=24000)
y_hat = vocos(y, bandwidth_id=bandwidth_id)
Integrate with 🐶 Bark text-to-audio model
See example notebook.
Improved versions (2500K steps)
Model Name | Dataset | Training Iterations | Parameters |
---|---|---|---|
charactr/vocos-mel-24khz | LibriTTS | 2.5 M | 13.5 M |
charactr/vocos-encodec-24khz | DNS Challenge | 2.5 M | 7.9 M |
Jump to ☞ , then Run. That's all!
- training
- 9.2 [iter/sec] @ NVIDIA A100 on paperspace gradient Notebook (MatMulTF32+/ConvTF32+/AMP+)
- take about 1.3 days for whole training
- inference
- batch RTF 18.7 @ Google Colab CPU instance (<=2cores, nb=1/L=5sec)
- stream RTF 1.5 @ Google Colab CPU instance (<=2cores, nb=1/hop=10msec)
@misc{2306.00814,
Author = {Hubert Siuzdak},
Title = {Vocos: Closing the gap between time-domain and Fourier-based neural vocoders for high-quality audio synthesis},
Year = {2023},
Eprint = {arXiv:2306.00814},
}
mrd_loss_coeff=1.0
might be better than default0.1
Also, you might want to set the mrd_loss_coeff to 1.0 right from the start. In my experience, it does slow down the convergence in terms of the UTMOS score a bit, but it's key for reducing the buzziness in the audio output.
issue#3