janvainer/speedyspeech

Duplicated padding when calculating STFT transformation.

Opened this issue · 3 comments

I went through the code about spectrogram extraction in the code base and found some duplication at padding.

For a simple case, applying stft to an audio, it is common to use center and reflect padding to simplify the time grid of sample index and frame index. (It's also the default behavior of librosa).

But for a batch of audios. An batched STFT implemented with Convolution, there is not straight way to handle reflect padding for each audio since they are already padded. Since we would have to find the ending of each audio to correctly apply reflect padding at the end.

I noticed that in your code, each audio is first center reflect padded. And then they are padded to the max length and batched. (

lens = [len(yy) for yy in y]
)

But when passing this batched audio to the STFT module, they are padded again.(

# similar to librosa, reflect-pad the input
)

I think when the audios are mannually reflect padded, they should not be padded again when applying batched STFT. Librosa's stft has an option center=False to disable internal padding.

Hm it seems like you are right. This may be a bug. Did you compare the results with and without it?

This may not affect the quality much since the length of padding is very short (about 23ms).
I haven't compared the results.

This could maybe have some impact on the audio quality on the edges of the utterance. I noticed there some edge related problems before. Thanks for pointing this out!