huggingface/pytorch-image-models

[BUG] CoAtNet_0 Model different from paper

karam-nus opened this issue · 1 comments

CoAtNet_0 model defined in paper has 5 repeating RelTransformer blocks in stage S3, where as timm implementation has 7.

image

We also see a difference in Top1 for this model.
Top1 in paper : 81.2
Top1 reported on HF model card: 82.39

Actual top1 on IN-1k: 78.87

Steps to reproduce the behavior:

  1. Get the model from HF/TIMM :

pt_model = timm.create_model('coatnet_0_rw_224.sw_in1k', pretrained=True)

  1. Validate on imagenet-1k validation set

image

Model accuracy reported in HF documentation should be 78.87 and not 82.39.
image

@karam-nus I'm well aware, the models have rw in the name because they're my spin on the models. There are many comments, pointers in the code. This isn't the only difference.

depths=(2, 3, 7, 2), # deeper than paper '0' model

def _rw_coat_cfg(
stride_mode='pool',
pool_type='avg2',
conv_output_bias=False,
conv_attn_early=False,
conv_attn_act_layer='relu',
conv_norm_layer='',
transformer_shortcut_bias=True,
transformer_norm_layer='layernorm2d',
transformer_norm_layer_cl='layernorm',
init_values=None,
rel_pos_type='bias',
rel_pos_dim=512,
):
# 'RW' timm variant models were created and trained before seeing https://github.com/google-research/maxvit
# Common differences for initial timm models:
# - pre-norm layer in MZBConv included an activation after norm
# - mbconv expansion calculated from input instead of output chs
# - mbconv shortcut and final 1x1 conv did not have a bias
# - SE act layer was relu, not silu
# - mbconv uses silu in timm, not gelu
# - expansion in attention block done via output proj, not input proj
# Variable differences (evolved over training initial models):
# - avg pool with kernel_size=2 favoured downsampling (instead of maxpool for coat)
# - SE attention was between conv2 and norm/act
# - default to avg pool for mbconv downsample instead of 1x1 or dw conv
# - transformer block shortcut has no bias
return dict(
conv_cfg=MaxxVitConvCfg(
stride_mode=stride_mode,
pool_type=pool_type,
pre_norm_act=True,
expand_output=False,
output_bias=conv_output_bias,
attn_early=conv_attn_early,
attn_act_layer=conv_attn_act_layer,
act_layer='silu',
norm_layer=conv_norm_layer,
),
transformer_cfg=MaxxVitTransformerCfg(
expand_first=False,
shortcut_bias=transformer_shortcut_bias,
pool_type=pool_type,
init_values=init_values,
norm_layer=transformer_norm_layer,
norm_layer_cl=transformer_norm_layer_cl,
rel_pos_type=rel_pos_type,
rel_pos_dim=rel_pos_dim,
),
)

There are more paper-like models but I never trained any,

coatnet_0=MaxxVitCfg(
embed_dim=(96, 192, 384, 768),
depths=(2, 3, 5, 2),
stem_width=64,
head_hidden_size=768,
),

If you aren't within +/- .1-.2 of the official eval results (https://github.com/huggingface/pytorch-image-models/blob/main/results/results-imagenet.csv) your eval is wrong.