Question regarding Imagenet pretraining
Closed this issue · 4 comments
Thanks for the nice work! I have a question regarding model training reported in the paper. Its says
With only ImageNet-1K pretraining,
our UniFormer achieves 82.9%/84.8% top-1 accuracy on Kinetics-400/Kinetics600,
My question is the models are video models with n frames as input, whereas ImageNet is an image data with single inputs. So, my question is which part has ImageNet pretrained weights?
All the parts have ImageNet-pretraining. For convolution, if the temporal dimension is larger than 1, we will copy and average the convolution weights. For self-attention, we copy the same weights. Please check the code
UniFormer/video_classification/slowfast/models/uniformer.py
Lines 387 to 421 in f92e423
Thanks a lot for the quick response, the pointer to the code helps a lot! Just two follow-up questions.
- I understand the imagenet pertaining is done on the image-based Uniformer architectures and transferred to video uniformer architectures by inflating weights as above, right?
- a) Is there a table showing a comparison between imagenet pertaining vs not? b) I see that Table 17 in the paper presents some results showing inflating the weights to 3D performs better than 2D. What is the basis of this comparison? Because if it is a video model, the 3D inflation was always done right ? Whether centered around the middle slice or equally averaged across the time dimension. So what is the 2D comparison here?
Thanks a lot again for your time to answer the questions!
For convolution inflation, I suggest you read paper I3D.
As for your other questions:
- Yes.
- a) Without ImageNet pretraining, the convergence will be much slower, which is a common strategy in video training. b)
2D
means we do not inflate the convolution, and merge the temporal dimension with the batch dimension. But for attention, we use spatiotemporal attention.
Thanks a lot for the answers!