sayakpaul/probing-vits

Supervised training doesn't help that much for extracting salient representations it seems?

sayakpaul opened this issue · 0 comments

In the DINO blog post, the authors show the following:

image

This is what they say in the video caption:

The original video is shown on the left. In the middle is a segmentation example generated by a supervised model, and on the right is one generated by DINO. (All examples are licensed from Stock.)

We see that the attention maps generated with the supervised pre-trained model aren't that salient w.r.t the DINO model.

Seems to be verified:

Here's the Colab Notebook that verified it. The notebook is not formatted (be aware).