taoyang1122/adapt-image-models

I would like to know what testing protocol the 88.9% on Diving48 is based on.

Changwei-Ouyang opened this issue · 0 comments

Regarding the reported 88.9% accuracy of ViT-B on the Diving48 dataset in the paper, I would like to know the testing protocol on which this result is based.
val_pipeline = [
dict(type='DecordInit'),
dict(
type='SampleFrames',
clip_len=32,
frame_interval=16,
num_clips=1,
frame_uniform=True,
test_mode=True),
dict(type='DecordDecode'),
dict(type='Resize', scale=(-1, 256)),
dict(type='CenterCrop', crop_size=224),
dict(type='Flip', flip_ratio=0),
dict(type='Normalize', **img_norm_cfg),
dict(type='FormatShape', input_format='NCTHW'),
dict(type='Collect', keys=['imgs', 'label'], meta_keys=[]),
dict(type='ToTensor', keys=['imgs'])
]
When using the aforementioned validation settings, the obtained result for testing the vit_b_clip_32frame_diving48.pth is 88.88%.
test_pipeline = [
dict(type='DecordInit'),
dict(
type='SampleFrames',
clip_len=32,
frame_interval=16,
num_clips=1,
frame_uniform=True,
test_mode=True),
dict(type='DecordDecode'),
dict(type='Resize', scale=(-1, 224)),
dict(type='ThreeCrop', crop_size=224),
dict(type='Flip', flip_ratio=0),
dict(type='Normalize', **img_norm_cfg),
dict(type='FormatShape', input_format='NCTHW'),
dict(type='Collect', keys=['imgs', 'label'], meta_keys=[]),
dict(type='ToTensor', keys=['imgs'])
]
When using the mentioned test configuration, the obtained result is lower than 88.9%. Moreover, the ThreeCrop operation does not align with the mentioned 32×1×1 in the paper.Therefore, I would like to understand the testing protocol underlying the reported 88.9% result in the paper.