[BUG] CoAtNet_0 Model different from paper

CoAtNet_0 model defined in paper has 5 repeating RelTransformer blocks in stage S3, where as timm implementation has 7.

We also see a difference in Top1 for this model.
Top1 in paper : 81.2
Top1 reported on HF model card: 82.39

Actual top1 on IN-1k: 78.87

Steps to reproduce the behavior:

Get the model from HF/TIMM :

pt_model = timm.create_model('coatnet_0_rw_224.sw_in1k', pretrained=True)

Validate on imagenet-1k validation set

Model accuracy reported in HF documentation should be 78.87 and not 82.39.

@karam-nus I'm well aware, the models have rw in the name because they're my spin on the models. There are many comments, pointers in the code. This isn't the only difference.

pytorch-image-models/timm/models/maxxvit.py

Line 1514 in 7160af4

depths=(2, 3, 7, 2), # deeper than paper '0' model

pytorch-image-models/timm/models/maxxvit.py

Lines 1340 to 1389 in 7160af4

    
           def _rw_coat_cfg( 
        
                   stride_mode='pool', 
        
                   pool_type='avg2', 
        
                   conv_output_bias=False, 
        
                   conv_attn_early=False, 
        
                   conv_attn_act_layer='relu', 
        
                   conv_norm_layer='', 
        
                   transformer_shortcut_bias=True, 
        
                   transformer_norm_layer='layernorm2d', 
        
                   transformer_norm_layer_cl='layernorm', 
        
                   init_values=None, 
        
                   rel_pos_type='bias', 
        
                   rel_pos_dim=512, 
        
           ): 
        
               # 'RW' timm variant models were created and trained before seeing https://github.com/google-research/maxvit 
        
               # Common differences for initial timm models: 
        
               # - pre-norm layer in MZBConv included an activation after norm 
        
               # - mbconv expansion calculated from input instead of output chs 
        
               # - mbconv shortcut and final 1x1 conv did not have a bias 
        
               # - SE act layer was relu, not silu 
        
               # - mbconv uses silu in timm, not gelu 
        
               # - expansion in attention block done via output proj, not input proj 
        
               # Variable differences (evolved over training initial models): 
        
               # - avg pool with kernel_size=2 favoured downsampling (instead of maxpool for coat) 
        
               # - SE attention was between conv2 and norm/act 
        
               # - default to avg pool for mbconv downsample instead of 1x1 or dw conv 
        
               # - transformer block shortcut has no bias 
        
               return dict( 
        
                   conv_cfg=MaxxVitConvCfg( 
        
                       stride_mode=stride_mode, 
        
                       pool_type=pool_type, 
        
                       pre_norm_act=True, 
        
                       expand_output=False, 
        
                       output_bias=conv_output_bias, 
        
                       attn_early=conv_attn_early, 
        
                       attn_act_layer=conv_attn_act_layer, 
        
                       act_layer='silu', 
        
                       norm_layer=conv_norm_layer, 
        
                   ), 
        
                   transformer_cfg=MaxxVitTransformerCfg( 
        
                       expand_first=False, 
        
                       shortcut_bias=transformer_shortcut_bias, 
        
                       pool_type=pool_type, 
        
                       init_values=init_values, 
        
                       norm_layer=transformer_norm_layer, 
        
                       norm_layer_cl=transformer_norm_layer_cl, 
        
                       rel_pos_type=rel_pos_type, 
        
                       rel_pos_dim=rel_pos_dim, 
        
                   ), 
        
               )

There are more paper-like models but I never trained any,

pytorch-image-models/timm/models/maxxvit.py

Lines 1648 to 1653 in 7160af4

    
           coatnet_0=MaxxVitCfg( 
        
               embed_dim=(96, 192, 384, 768), 
        
               depths=(2, 3, 5, 2), 
        
               stem_width=64, 
        
               head_hidden_size=768, 
        
           ),

If you aren't within +/- .1-.2 of the official eval results (https://github.com/huggingface/pytorch-image-models/blob/main/results/results-imagenet.csv) your eval is wrong.

	def _rw_coat_cfg(
	stride_mode='pool',
	pool_type='avg2',
	conv_output_bias=False,
	conv_attn_early=False,
	conv_attn_act_layer='relu',
	conv_norm_layer='',
	transformer_shortcut_bias=True,
	transformer_norm_layer='layernorm2d',
	transformer_norm_layer_cl='layernorm',
	init_values=None,
	rel_pos_type='bias',
	rel_pos_dim=512,
	):
	# 'RW' timm variant models were created and trained before seeing https://github.com/google-research/maxvit
	# Common differences for initial timm models:
	# - pre-norm layer in MZBConv included an activation after norm
	# - mbconv expansion calculated from input instead of output chs
	# - mbconv shortcut and final 1x1 conv did not have a bias
	# - SE act layer was relu, not silu
	# - mbconv uses silu in timm, not gelu
	# - expansion in attention block done via output proj, not input proj
	# Variable differences (evolved over training initial models):
	# - avg pool with kernel_size=2 favoured downsampling (instead of maxpool for coat)
	# - SE attention was between conv2 and norm/act
	# - default to avg pool for mbconv downsample instead of 1x1 or dw conv
	# - transformer block shortcut has no bias
	return dict(
	conv_cfg=MaxxVitConvCfg(
	stride_mode=stride_mode,
	pool_type=pool_type,
	pre_norm_act=True,
	expand_output=False,
	output_bias=conv_output_bias,
	attn_early=conv_attn_early,
	attn_act_layer=conv_attn_act_layer,
	act_layer='silu',
	norm_layer=conv_norm_layer,
	),
	transformer_cfg=MaxxVitTransformerCfg(
	expand_first=False,
	shortcut_bias=transformer_shortcut_bias,
	pool_type=pool_type,
	init_values=init_values,
	norm_layer=transformer_norm_layer,
	norm_layer_cl=transformer_norm_layer_cl,
	rel_pos_type=rel_pos_type,
	rel_pos_dim=rel_pos_dim,
	),
	)

	coatnet_0=MaxxVitCfg(
	embed_dim=(96, 192, 384, 768),
	depths=(2, 3, 5, 2),
	stem_width=64,
	head_hidden_size=768,
	),