encoder+linear is over-compressing time dimension
lunixbochs opened this issue · 1 comments
lunixbochs commented
The layout of the conv2d and linear layers in your encoder end up cramping the output tokens:
forward torch.Size([32, 1600, 80])
self.encode torch.Size([32, 99, 1216])
self.linear torch.Size([32, 99, 144])
self.conformer torch.Size([32, 99, 144])
I would expect a shape closer to [32, 400, 144]
leading into the Conformer blocks as (batch, time, freq
)
From the Conformer Paper, your encode+linear is taking the place of their "Convolution Subsampling" step, which converts from a 10ms frame size to 40ms.
The wav2letter RASR Conformer model uses a single nn.Conv1d(nfeat, H*2, kernel_size=7, stride=3)
followed by a nn.GLU(1)
which yields ~30ms frames.
geohot commented
Yea, I know. I have an expander at the end as a flag, I just need to retrain. The Google 40ms is for syllables, not letters, and that's what I missed.