jerryuhoo/VISinger

Mismatch shape for f0 prediction and KL divergence

Closed this issue · 2 comments

Hi, I recently found your VISinger implementation (this is awesome) but I ran into a few problems.
The first problem is that the shape is mismatched for f0 predictions as well as KL divergence calculations (logs_p & logs_q).

x: torch.Size([6, 192, 40])
x_frame: torch.Size([6, 192, 581]) 
ground truth f0: torch.Size([6, 558]) 
pred f0: torch.Size([6, 581])

I am also curious about how you choose hyperparameters. For instance, I understand 2595 & 700 are for mel conversion, but how do you decide 500 here?
featur_pit = 2595.0 * np.log10(1.0 + featur_pit / 700.0) / 500

The mel length mismatch problem (581 and 558) should be solved during preprocessing. Be sure that the script is python prepare/data_vits_phn.py not python prepare/data_vits.py which is from the original repo.
For the hyperparameters I chose, I just simply emailed the original author and asked him about how he calculated the pitch. I think the reason that he divided 500 is to balance the pitch loss so that it won't be too large.

Thank you so much for your reply! I resample wav files to 24k explicitly, and it works well now!