Unofficial SoundStream implementation of Pytorch with training code and 16kHz pretrained checkpoint.
16kHz pretrained model was trained on LibriSpeech train-clean-100 with NVIDIA T4 for about 150 epochs (around 50 hours) in total. The model is not causal.
import torchaudio
import torch
model = torch.hub.load("kaiidams/soundstream-pytorch", "soundstream_16khz")
x, sr = torchaudio.load('input.wav')
x, sr = torchaudio.functional.resample(x, sr, 16000), 16000
with torch.no_grad():
y = model.encode(x)
# y = y[:, :, :4] # if you want to reduce code size.
z = model.decode(y)
torchaudio.save('output.wav', z, sr)
Audio references are sampled from LibriSpeech test-clean.
Reference | SoundStream |
---|---|
audio link | audio link |
audio link | audio link |
audio link | audio link |