After just using VAE reconstruct a audio, I only get noise
SuperiorDtj opened this issue · 5 comments
Here is my code. Is there something wrong on my method about using vae?
`def recon_vae(self, filename):
""" recon audio only by vae """
with torch.no_grad():
waveform, sample_rate = torchaudio.load(filename)
waveform = torchaudio.functional.resample(waveform, orig_freq=sample_rate, new_freq=16000)[0]
waveform = waveform - torch.mean(waveform)
waveform = waveform / (torch.max(torch.abs(waveform)) + 1e-8)
waveform = 0.5 * waveform
waveform = waveform / torch.max(torch.abs(waveform))
waveform = 0.5 * waveform
#waveform = 0.5 * waveform[0:int(len(waveform)*1)]
audio = torch.unsqueeze(waveform, 0)
audio = torch.nan_to_num(torch.clip(audio, -1, 1))
audio = torch.autograd.Variable(audio, requires_grad=False)
melspec, log_magnitudes_stft, energy = self.stft.mel_spectrogram(audio)
melspec = melspec.transpose(1, 2)
melspec = melspec.unsqueeze(1)
truth_lattent = self.vae.get_first_stage_encoding(self.vae.encode_first_stage(melspec))
mel_recon = self.vae.decode_first_stage(truth_lattent)
wave = self.vae.decode_to_waveform(mel_recon)
return wave[0], waveform`
Can you try the folllowing:
import torch
import torchaudio
from tango import Tango
from tools.torch_tools import wav_to_fbank
filename = ...
device = "cuda:0"
tango = Tango("declare-lab/tango", device)
tango.vae.eval()
tango.stft.eval()
duration = 10
target_length = int(duration * 102.4)
with torch.no_grad():
mel, _, waveform = wav_to_fbank([filename], target_length, tango.stft)
mel = mel.unsqueeze(1).to(device)
latent = tango.vae.get_first_stage_encoding(tango.vae.encode_first_stage(mel))
reconstructed_mel = tango.vae.decode_first_stage(latent)
reconstructed_waveform = tango.vae.decode_to_waveform(reconstructed_mel)[0]
Can you try the folllowing:
import torch import torchaudio from tango import Tango from tools.torch_tools import wav_to_fbank filename = ... device = "cuda:0" tango = Tango("declare-lab/tango", device) tango.vae.eval() tango.stft.eval() duration = 10 target_length = int(duration * 102.4) with torch.no_grad(): mel, _, waveform = wav_to_fbank([filename], target_length, tango.stft) mel = mel.unsqueeze(1).to(device) latent = tango.vae.get_first_stage_encoding(tango.vae.encode_first_stage(mel)) reconstructed_mel = tango.vae.decode_first_stage(latent) reconstructed_waveform = tango.vae.decode_to_waveform(reconstructed_mel)[0]
Thanks for your code!Now I can reconstruct the audio, but only in the situation that the number of the audio's frames is the multiple of four(3.6s dur instead of 3.7s dur)it can reconstruct the audio.
Is this commom issue of the VAE model?
What is the exact issue when reconstructing a 3.7s audio? Does it generate noise for the entire 3.7s or the last 0.1s?
What is the exact issue when reconstructing a 3.7s audio? Does it generate noise for the entire 3.7s or the last 0.1s?
When the VAE reconsturct a 3.7s audio, it generate noise for the entire 3.7s
I meet the same problem as u. Have the problem been solved? I tried making reconstruction on the same one audio smaple for several times, the reconstructed results are always very different noise. And the results of each reconstruction vary greatly from one another.
The only one solution is setting the duration like this?
target_length = int(duration * 102.4)