In the paper, it is mentioned that visualizing the last layer of attention graph, how is this operation done?

Question

In the paper, it is mentioned that visualizing the last layer of attention graph, how is this operation done?

notfacezhi opened this issue a year ago · 4 comments

I don't understand what the points in this graph represent, and how the attention graph connected to this point is visualized. In the self-attention process, the input shape is (b, c, h, w) - > (b, h * w, c) The attention graph is (h * w, h * w) How to visualize this on the original image?

Answer 1 · 2023-07-28T16:29:03.000Z

Hi,

In https://github.com/facebookresearch/detr#notebooks the first notebook has the code to visualize the images that we used in the paper, including the attention matrix.

Each point in the original image (which corresponds to a line (or column) in the attention mask) can be reshaped as an image.

I believe I've answered your question and as such I'm closing this issue

Answer 2 · 2023-08-07T21:48:21.000Z

hey @fmassa, thanks for the great detr work! I've been trying to replicate some of the work illustrations.

I'd expect the self-attention weights would come from the operation attn = (q*scale) @ k.T that weighs the values. It turned out that looking at the detr repo at the Transformers classes definition: https://github.com/facebookresearch/detr/blob/main/models/transformer.py#L127, the forward outcome only yields the final tensor of dimensions (b, h * w, c) .

I don't know how you could get the hook's outcome from the colab's notebook. Is there any other code that the colab model used?

Answer 3 · 2023-11-07T12:21:33.000Z

hey @fmassa, thanks for the great detr work! I've been trying to replicate some of the work illustrations.

I'd expect the self-attention weights would come from the operation attn = (q*scale) @ k.T that weighs the values. It turned out that looking at the detr repo at the Transformers classes definition: https://github.com/facebookresearch/detr/blob/main/models/transformer.py#L127, the forward outcome only yields the final tensor of dimensions (b, h * w, c) .

I don't know how you could get the hook's outcome from the colab's notebook. Is there any other code that the colab model used?

Did you figure this out?

Answer 4 · 2023-11-08T18:57:47.000Z

@MLDeS I just used the model straight from detr = torch.hub.load('facebookresearch/detr', 'detr_resnet50', pretrained=True). Then, the hook from the detr.transformer.encoder.layers[-1].self_attn comes with two outputs, one is the features map and the other is the attention map.