About input frames and sampling interval

Thank you for your excellent work! By the way I want to know about clip_len and frame_interval for Kinetics. In Appendix A.1, "We evaluate the model on 8, 16, 32 frames and the sampling interval is 16, 8, 4, respectively." Does this mean for kinetics400/700, the data pipeline (train, val, test) should be the same? For example, in configs/recognition/vit/vit_imagenet_k400.py, the config of data pipeline keeps the same as the paper mentioned.

i.e., clip_len=8, frame_interval=16 for train/val/test pipeline, which is the same as the paper mentioned.

adapt-image-models/configs/recognition/vit/vit_imagenet_k400.py

Lines 19 to 21 in 392647e

    
           train_pipeline = [ 
        
               dict(type='DecordInit'), 
        
               dict(type='SampleFrames', clip_len=8, frame_interval=16, num_clips=1),

adapt-image-models/configs/recognition/vit/vit_imagenet_k400.py

Lines 32 to 39 in 392647e

    
           val_pipeline = [ 
        
               dict(type='DecordInit'), 
        
               dict( 
        
                   type='SampleFrames', 
        
                   clip_len=8, 
        
                   frame_interval=16, 
        
                   num_clips=1, 
        
                   test_mode=True),

adapt-image-models/configs/recognition/vit/vit_imagenet_k400.py

Lines 49 to 56 in 392647e

    
           test_pipeline = [ 
        
               dict(type='DecordInit'), 
        
               dict( 
        
                   type='SampleFrames', 
        
                   clip_len=8, 
        
                   frame_interval=16, 
        
                   num_clips=3, 
        
                   test_mode=True),

But, for CLIP pretrained, the configs are confused.

vitclip_base_k400, clip_len=32, frame_interval=16 for train pipeline, while clip_len=32, frame_interval=8 for val/test pipeline. However, if clip_len=32, frame_interval should be 4?

adapt-image-models/configs/recognition/vit/vitclip_base_k400.py

Lines 19 to 21 in 392647e

    
           train_pipeline = [ 
        
               dict(type='DecordInit'), 
        
               dict(type='SampleFrames', clip_len=32, frame_interval=16, num_clips=1),

adapt-image-models/configs/recognition/vit/vitclip_base_k400.py

Lines 32 to 39 in 392647e

    
           val_pipeline = [ 
        
               dict(type='DecordInit'), 
        
               dict( 
        
                   type='SampleFrames', 
        
                   clip_len=32, 
        
                   frame_interval=8, 
        
                   num_clips=1, 
        
                   test_mode=True),

adapt-image-models/configs/recognition/vit/vitclip_base_k400.py

Lines 49 to 56 in 392647e

    
           test_pipeline = [ 
        
               dict(type='DecordInit'), 
        
               dict( 
        
                   type='SampleFrames', 
        
                   clip_len=32, 
        
                   frame_interval=8, 
        
                   num_clips=3, 
        
                   test_mode=True),

vitclip_large_k400, clip_len=16, frame_interval=16 for train/val/test pipeline. However, if clip_len=16, frame_interval should be 8?

adapt-image-models/configs/recognition/vit/vitclip_large_k400.py

Lines 19 to 21 in 392647e

    
           train_pipeline = [ 
        
               dict(type='DecordInit'), 
        
               dict(type='SampleFrames', clip_len=16, frame_interval=16, num_clips=1),

adapt-image-models/configs/recognition/vit/vitclip_large_k400.py

Lines 32 to 39 in 392647e

    
               dict(type='ToTensor', keys=['imgs', 'label']) 
        
           ] 
        
           val_pipeline = [ 
        
               dict(type='DecordInit'), 
        
               dict( 
        
                   type='SampleFrames', 
        
                   clip_len=16, 
        
                   frame_interval=16,

adapt-image-models/configs/recognition/vit/vitclip_large_k400.py

Lines 49 to 56 in 392647e

    
               dict(type='ToTensor', keys=['imgs']) 
        
           ] 
        
           test_pipeline = [ 
        
               dict(type='DecordInit'), 
        
               dict( 
        
                   type='SampleFrames', 
        
                   clip_len=16, 
        
                   frame_interval=16,

Thank you.

Hi @BinhuiXie , thanks for your interest in our work. You can safely follow the settings descripbed in the paper. I will update the codes.

	train_pipeline = [
	dict(type='DecordInit'),
	dict(type='SampleFrames', clip_len=8, frame_interval=16, num_clips=1),

	val_pipeline = [
	dict(type='DecordInit'),
	dict(
	type='SampleFrames',
	clip_len=8,
	frame_interval=16,
	num_clips=1,
	test_mode=True),

	test_pipeline = [
	dict(type='DecordInit'),
	dict(
	type='SampleFrames',
	clip_len=8,
	frame_interval=16,
	num_clips=3,
	test_mode=True),

	train_pipeline = [
	dict(type='DecordInit'),
	dict(type='SampleFrames', clip_len=32, frame_interval=16, num_clips=1),

	train_pipeline = [
	dict(type='DecordInit'),
	dict(type='SampleFrames', clip_len=16, frame_interval=16, num_clips=1),

	dict(type='ToTensor', keys=['imgs', 'label'])
	]
	val_pipeline = [
	dict(type='DecordInit'),
	dict(
	type='SampleFrames',
	clip_len=16,
	frame_interval=16,

	dict(type='ToTensor', keys=['imgs'])
	]
	test_pipeline = [
	dict(type='DecordInit'),
	dict(
	type='SampleFrames',
	clip_len=16,
	frame_interval=16,