Question about batch size vs num frames
cyrilzakka opened this issue · 4 comments
Hello again,
I have one last question that I'm still unclear about. In this implementation, the size of the input being fed into the network is (B x C x H x W) with B being the number of frames correct? Or is it actually (B x F x C x H x W) with F being the number of frames?
Hi @cyrilzakka
The initial input shape is of BF x C x H x W, batch_size and num_frames flattened together.
Before the transformer encoder phase, all the computations are independent among each frame.
The idea behind flattening batch_size and num_frame is that pytorch's conv2d operations take input of (batch, channel, height, width).
In order to make computations separate and avoid the use of 'for loops', we flatten batch_size and num_frames.
Thanks for the answer! Would you mind pointing me at the line of code responsible for converting B x F x C X H X W to BF x C X H X W and then back prior to the transformer?
All I found are:
Line 192 in cfa38e3
and:
Thank you for your time!