geohot/tinyvoice

encoder+linear is over-compressing time dimension

lunixbochs opened this issue · 1 comments

The layout of the conv2d and linear layers in your encoder end up cramping the output tokens:

forward        torch.Size([32, 1600, 80])
self.encode    torch.Size([32, 99, 1216])
self.linear    torch.Size([32, 99, 144])
self.conformer torch.Size([32, 99, 144])

I would expect a shape closer to [32, 400, 144] leading into the Conformer blocks as (batch, time, freq)

From the Conformer Paper, your encode+linear is taking the place of their "Convolution Subsampling" step, which converts from a 10ms frame size to 40ms.

The wav2letter RASR Conformer model uses a single nn.Conv1d(nfeat, H*2, kernel_size=7, stride=3) followed by a nn.GLU(1) which yields ~30ms frames.

Yea, I know. I have an expander at the end as a flag, I just need to retrain. The Google 40ms is for syllables, not letters, and that's what I missed.