Sense-X/UniFormer

Question regarding Imagenet pretraining

MLDeS opened this issue · 4 comments

MLDeS commented

Thanks for the nice work! I have a question regarding model training reported in the paper. Its says

With only ImageNet-1K pretraining,
our UniFormer achieves 82.9%/84.8% top-1 accuracy on Kinetics-400/Kinetics600,

My question is the models are video models with n frames as input, whereas ImageNet is an image data with single inputs. So, my question is which part has ImageNet pretrained weights?

All the parts have ImageNet-pretraining. For convolution, if the temporal dimension is larger than 1, we will copy and average the convolution weights. For self-attention, we copy the same weights. Please check the code

def inflate_weight(self, weight_2d, time_dim, center=False):
if center:
weight_3d = torch.zeros(*weight_2d.shape)
weight_3d = weight_3d.unsqueeze(2).repeat(1, 1, time_dim, 1, 1)
middle_idx = time_dim // 2
weight_3d[:, :, middle_idx, :, :] = weight_2d
else:
weight_3d = weight_2d.unsqueeze(2).repeat(1, 1, time_dim, 1, 1)
weight_3d = weight_3d / time_dim
return weight_3d
def get_pretrained_model(self, cfg):
if cfg.UNIFORMER.PRETRAIN_NAME:
checkpoint = torch.load(model_path[cfg.UNIFORMER.PRETRAIN_NAME], map_location='cpu')
if 'model' in checkpoint:
checkpoint = checkpoint['model']
elif 'model_state' in checkpoint:
checkpoint = checkpoint['model_state']
state_dict_3d = self.state_dict()
for k in checkpoint.keys():
if checkpoint[k].shape != state_dict_3d[k].shape:
if len(state_dict_3d[k].shape) <= 2:
logger.info(f'Ignore: {k}')
continue
logger.info(f'Inflate: {k}, {checkpoint[k].shape} => {state_dict_3d[k].shape}')
time_dim = state_dict_3d[k].shape[2]
checkpoint[k] = self.inflate_weight(checkpoint[k], time_dim)
if self.num_classes != checkpoint['head.weight'].shape[0]:
del checkpoint['head.weight']
del checkpoint['head.bias']
return checkpoint
else:
return None

MLDeS commented

Thanks a lot for the quick response, the pointer to the code helps a lot! Just two follow-up questions.

  1. I understand the imagenet pertaining is done on the image-based Uniformer architectures and transferred to video uniformer architectures by inflating weights as above, right?
  2. a) Is there a table showing a comparison between imagenet pertaining vs not? b) I see that Table 17 in the paper presents some results showing inflating the weights to 3D performs better than 2D. What is the basis of this comparison? Because if it is a video model, the 3D inflation was always done right ? Whether centered around the middle slice or equally averaged across the time dimension. So what is the 2D comparison here?

Thanks a lot again for your time to answer the questions!

For convolution inflation, I suggest you read paper I3D.

As for your other questions:

  1. Yes.
  2. a) Without ImageNet pretraining, the convergence will be much slower, which is a common strategy in video training. b) 2D means we do not inflate the convolution, and merge the temporal dimension with the batch dimension. But for attention, we use spatiotemporal attention.
MLDeS commented

Thanks a lot for the answers!