pretrained model

Question

pretrained model

Opened this issue a year ago · 7 comments

Hello, I really appreciate your work. May I ask where can I download the pretrained model for vit on Imagenet?

Answer 1 · 2023-04-13T17:21:41.000Z

Hi, you can download imagenet pre-trained ViT from timm.

Answer 2 · 2023-04-15T07:41:27.000Z

Get it. Thank you! What does ” Views = #frames × #temporal × #spatial “ mean? Does it mean "clip_len, frame_interval, num_clips" in the training?

Answer 3 · 2023-04-18T15:53:50.000Z

We usually do multi-view testing. So that is 'num_frame in one clip' x 'number of clip sampled by temporal crop' x 'number of clips sampled by spatial crop'.

Answer 4 · 2023-04-19T12:23:26.000Z

ok, thank you. Why will the running environment in the server be affected after modifying the https://github.com/taoyang1122/adapt-image-models/blob/main/mmaction/datasets/base.py? What if I want to modify base.py without affecting the running environment?

Answer 5 · 2023-07-17T00:05:56.000Z

Hi, I am not sure what do you mean by the environment is affected. Could you please explain in more details?

Answer 6 · 2023-09-18T08:36:17.000Z

We usually do multi-view testing. So that is 'num_frame in one clip' x 'number of clip sampled by temporal crop' x 'number of clips sampled by spatial crop'.

Could you please tell me what these specific entries mean, #frames is the number of frames in a single clip, so how should I understand the 'number of clip sampled by temporal crop' and 'number of clips sampled by spatial crop' and exactly how they are obtained? Additionally, in tables of the paper some places are frames x 1 x 3, and some places are frames x 3 x 1. Why would there be such a difference, please?

Answer 7 · 2023-09-19T02:19:44.000Z

The way they are obtained may be different in different methods/datasets. For example, three temporal crops can be obtained by sampling from the first part, middle part, and last part of the video. Three spatial crops cound be obtained by cropping upper-left corner, center, and lower-right corner. The final prediction is the ensemble of different crops (i.e., views). Then the numbers mean frames x num_temporal_crops x num_spatial_crops during testing.