auspicious3000/SpeechSplit

Tuning bottlenecks according to Appendix B.4

vishal16babu opened this issue · 2 comments

Although the tuning process mentioned is very intuitive, it seems like there's no theoretical guarantee that the same bottleneck sizes will work for all speakers. I think it's a research problem in itself to be able to decide the bottlenecks directly from the speech(without going through the manual tuning process).

But practically speaking, it might be possible that a set of bottleneck sizes might work well in general for most of the cases. Is that the case with the sizes used in the repo? Did anyone try using the same sizes on a different dataset? Since training takes a long time, for each iteration of the tuning process for every new speaker or dataset, I'm afraid the approach might become very impractical to use.

@auspicious3000 any insights or help is very much appreciated

The bottleneck sizes provided in the paper is a good start. Training usually takes less than 24 hours. As a research project, our main purpose is to make sizable progress towards and provide insights for unsupervised speaking style transfer, and hopefully, inspire other researchers in this area.

Thanks @auspicious3000 , I will give it a try.