sukjunhwang/IFC

Question about batch size vs num frames

cyrilzakka opened this issue · 4 comments

Hello again,

I have one last question that I'm still unclear about. In this implementation, the size of the input being fed into the network is (B x C x H x W) with B being the number of frames correct? Or is it actually (B x F x C x H x W) with F being the number of frames?

Hi @cyrilzakka

The initial input shape is of BF x C x H x W, batch_size and num_frames flattened together.
Before the transformer encoder phase, all the computations are independent among each frame.
The idea behind flattening batch_size and num_frame is that pytorch's conv2d operations take input of (batch, channel, height, width).
In order to make computations separate and avoid the use of 'for loops', we flatten batch_size and num_frames.

Thanks for the answer! Would you mind pointing me at the line of code responsible for converting B x F x C X H X W to BF x C X H X W and then back prior to the transformer?

All I found are:

video = self.preprocess_image(batched_inputs)

and:

bs = src.shape[0] // self.num_frames if is_train else 1

IFC/projects/IFC/ifc/ifc.py

Lines 396 to 405 in cfa38e3

def preprocess_image(self, batched_inputs):
"""
Normalize, pad and batch the input images.
"""
images = []
for video in batched_inputs:
for frame in video["image"]:
images.append(self.normalizer(frame.to(self.device)))
images = ImageList.from_tensors(images)
return images

Given the code above, all frames in all batched videos go into "images" list.
So, it is already the same to flattening B and F.

video = self.preprocess_image(batched_inputs)

Therefore, the shape of "video" tensor would be BF x 3 x H x W -- 3 denoting BGR

Thank you for your time!