Question about batch size vs num frames

Hello again,

I have one last question that I'm still unclear about. In this implementation, the size of the input being fed into the network is (B x C x H x W) with B being the number of frames correct? Or is it actually (B x F x C x H x W) with F being the number of frames?

Hi @cyrilzakka

The initial input shape is of BF x C x H x W, batch_size and num_frames flattened together.
Before the transformer encoder phase, all the computations are independent among each frame.
The idea behind flattening batch_size and num_frame is that pytorch's conv2d operations take input of (batch, channel, height, width).
In order to make computations separate and avoid the use of 'for loops', we flatten batch_size and num_frames.

Thanks for the answer! Would you mind pointing me at the line of code responsible for converting B x F x C X H X W to BF x C X H X W and then back prior to the transformer?

All I found are:

IFC/projects/IFC/ifc/ifc.py

Line 192 in cfa38e3

video = self.preprocess_image(batched_inputs)

and:

IFC/projects/IFC/ifc/models/transformer.py

Line 62 in cfa38e3

bs = src.shape[0] // self.num_frames if is_train else 1

IFC/projects/IFC/ifc/ifc.py

Lines 396 to 405 in cfa38e3

    
               def preprocess_image(self, batched_inputs): 
        
                   """ 
        
                   Normalize, pad and batch the input images. 
        
                   """ 
        
                   images = [] 
        
                   for video in batched_inputs: 
        
                       for frame in video["image"]: 
        
                           images.append(self.normalizer(frame.to(self.device))) 
        
                   images = ImageList.from_tensors(images) 
        
                   return images

Given the code above, all frames in all batched videos go into "images" list.
So, it is already the same to flattening B and F.

IFC/projects/IFC/ifc/ifc.py

Line 192 in cfa38e3

video = self.preprocess_image(batched_inputs)

Therefore, the shape of "video" tensor would be BF x 3 x H x W -- 3 denoting BGR

Thank you for your time!

	def preprocess_image(self, batched_inputs):
	"""
	Normalize, pad and batch the input images.
	"""
	images = []
	for video in batched_inputs:
	for frame in video["image"]:
	images.append(self.normalizer(frame.to(self.device)))
	images = ImageList.from_tensors(images)
	return images