Enhancing diffusion neural network for a single speaker speech based on Speech Flow architecture. Evaluation notebook.
Important
Network was trained using 5s intrevals, but it can work with any length of audio with slightly reduced quality.
- ⚡️ Restoring and improving audio
- 🎤 24khz mono audio
- 🚀 Can work directly with spectograms for speedup and tight pipelining
- 🤹♂️ Can work with unknown languages
enhance_demo.mp4
Supervoice Enhance consists of multiple networks, but they are all loaded using a single command and published using Torch Hub, so you can use it as follows:
import torch
import torchaudio
# Load model
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = torch.hub.load(repo_or_dir='ex3ndr/supervoice-enhance', model='enhance', vocoder = True) # vocoder = False if you don't need vocoder
model.to(device)
model.eval()
# Load audio
def load_mono_audio(path):
audio, sr = torchaudio.load(path)
if sr != model.sample_rate:
audio = torchaudio.transforms.Resample(sr, model.sample_rate)(audio)
sr = model.sample_rate
if audio.shape[0] > 1:
audio = audio.mean(dim=0, keepdim=True)
return audio[0]
audio = load_mono_audio("./eval/eval_2.wav")
# Enhance
enhanced = model.enhance(waveform = audio, steps = 8) # 8 is optimal, 32 is higer quality but sometimes it halluciantes
enhanced_spec = model.enhance(waveform = audio, steps = 8, vocoder = False) # Return spectogram without running vocoder
MIT