mcahny/rovit

[ROVIT] How to generate Figure 3 in the paper "Region-Aware Pretraining for Open-Vocabulary Object Detection with Vision Transformers"

Closed this issue · 1 comments

I am confused about how to generate Figure 3 in the paper "Region-Aware Pretraining for Open-Vocabulary Object Detection with Vision Transformers". The figure shows the brightness patterns of the learned positional embeddings for the ViT-B/16 backbone.

According to my understanding, each tile in the figure represents the cosine similarity between the positional embedding of one patch and the positional embeddings of all other patches. Therefore, I would expect each tile to contain a single value. However, the figure shows that each tile contains multiple values.

I am wondering why each tile contains multiple values. Is it because the cosine similarity between two vectors is a vector? If so, how do I interpret the values in each tile?

Can you please provide the code for generating Figure 3 in the paper?
image

close by explaination and code