OlaWod/FreeVC

s-o-p pronunciation , high-low tone distortion

lsw5835 opened this issue · 5 comments

Hi, Thank you for quick replies and kindness.

While testing through fine-tuning on various data, i found that distortion occurs in o, s, and p pronunciation in common, and distortion also occurs when generating high-pitched sounds rather than target speaker data. Is there a way to solve two problems?

Or, it would be very helpful if you could tell me the metric that can check the presence or absence of distortion.

Thanks very much.

  1. I didn't notice the distortion in s-o-p pronunciation before, it's an interesting finding. Is it occurred in words like 'Ok', 'appleS', 'helP', etc.? Maybe that's because these pronunciations are harder to model?
  2. I don't quite understand what the second problem is, what does 'rather than target speaker data' mean?

Maybe provide extra knowledges like pitch to the model can solve these problems?
So far the only distortion-related objective metric I know is WER/CER/PER.

I have Noticed the same, when a model is fine tuned to a specific speaker, the S sounds are really bad, so anything ending in S especially if its a longer S. I have only noticed it with S sounds though, I havent seen a problem with P or O.

Its a metallic sound, like phase offset from the vocoder.

I think he meant the output is not that of the target speaker?

  1. I didn't notice the distortion in s-o-p pronunciation before, it's an interesting finding. Is it occurred in words like 'Ok', 'appleS', 'helP', etc.? Maybe that's because these pronunciations are harder to model?
  2. I don't quite understand what the second problem is, what does 'rather than target speaker data' mean?

Maybe provide extra knowledges like pitch to the model can solve these problems? So far the only distortion-related objective metric I know is WER/CER/PER.

  1. Yes that error is generated from time to time.
  2. That means if the pitch or pronunciation in the source wav is not in the target, distortion occurs.

While looking for other augmentation methods, I found that there is a 'stretch function in utils.py'. Have you ever tested horizontal augmentation?

it'll be harder to convert if the voice of source and target are very different.
i did not use horizontal augmentation because it does not change the speaker information of source wav.
btw i'd like to share some files: ↓

original.mp4
change_duration.mp4
change_pitch.mp4
change_volume.mp4

i think, by applying different vertical sr rates in a wav, the augmentation can be stronger and help the disentanglement better.

Thanks very much for answering.