DINOv2 worse performance compared to the original version

Question

DINOv2 worse performance compared to the original version

davissf opened this issue a year ago · 5 comments

I've trained Vision Transformer (ViT) models, small and large, with DINOv2 pretrained weights from Facebook (vit_small_patch14_reg4_dinov2.lvd142m) and timm (dinov2_vits14_reg_lc). The timm version underperforms, as seen in feature and attention map, compared to Facebook's fine-tuned version. So any issues with timm's DINOv2 weights?

Despite this, timm's version is faster and uses less GPU VRAM than Facebook's. What are difference aspects of timm's implementation? remains unclear to me.

I hope someone can explain them
Thank you for great work

Answer 1 · 2024-02-13T01:24:42.000Z

@davissf a lot more detail required ... you also flipped the models and are using a linear-classifier version from the original and the base pretrain for timm as there are no LC versions in timm.... so should at very least be comparing dinov2_vits14_reg

Answer 2 · 2024-02-13T02:37:44.000Z

Thank you for your response.
I found that the outputs of the two pretrained weights (from timm and FB) are identical. The model I used was dinov2_vits14_reg_lc, where I utilized only its backbone (it's same with dinov2_vits14_reg weight, bcz they frozen backbone during the training of FC layer). Despite this, I got differing performance results between the two, even when all factors were held constant except for the backbone sourced from timm and Facebook.

Notably, the Facebook version consumes significantly more GPU RAM and requires a longer training time compared to the timm backbone. This discrepancy raises further questions about the underlying differences in implementation that could lead to that?

Answer 3 · 2024-02-13T02:49:04.000Z

@rwightman After fine-tuning on my own dataset, you can see the feature maps generated using backbones sourced from Facebook (on the left) and timm (on the right). The left one demonstrates better attention to hand skin lesions.

Answer 4 · 2024-02-13T07:45:52.000Z

@davissf I added F.scaled_dot_product_attention support to timm version, that is a noteworthy speedup / mem improvement. It shouldn't cause train differences but there have been some regressions in that feature in PyTorch. Can be disabled by setting TIMM_FUSED_ATTN=0 (should double check) in the env, or commenting out the path in the code to verify.

Are you sure the resolution is being handled the same for both? timm doesn't do dynamic interpolation by default, uses a fixed res, but guess that would error out if wrong although might pad differently... I think timm model needs dynamic_img_size and dynamic_img_pad set to true on model creation to match dinov2 defaults but don't have code open right now.

Answer 5 · 2024-02-16T16:29:41.000Z

@davissf any more info, won't be able to determine anything without example code, more details on environment, versions, resolutions being used... there are left/right border artifacts on the timm model which possibly suggests different image handling or resizing