facebookresearch/detr

In the paper, it is mentioned that visualizing the last layer of attention graph, how is this operation done?

notfacezhi opened this issue · 4 comments

image
I don't understand what the points in this graph represent, and how the attention graph connected to this point is visualized. In the self-attention process, the input shape is (b, c, h, w) - > (b, h * w, c) The attention graph is (h * w, h * w) How to visualize this on the original image?

fmassa commented

Hi,

In https://github.com/facebookresearch/detr#notebooks the first notebook has the code to visualize the images that we used in the paper, including the attention matrix.

Each point in the original image (which corresponds to a line (or column) in the attention mask) can be reshaped as an image.

I believe I've answered your question and as such I'm closing this issue

hey @fmassa, thanks for the great detr work! I've been trying to replicate some of the work illustrations.

I'd expect the self-attention weights would come from the operation attn = (q*scale) @ k.T that weighs the values. It turned out that looking at the detr repo at the Transformers classes definition: https://github.com/facebookresearch/detr/blob/main/models/transformer.py#L127, the forward outcome only yields the final tensor of dimensions (b, h * w, c) .

I don't know how you could get the hook's outcome from the colab's notebook. Is there any other code that the colab model used?

MLDeS commented

hey @fmassa, thanks for the great detr work! I've been trying to replicate some of the work illustrations.

I'd expect the self-attention weights would come from the operation attn = (q*scale) @ k.T that weighs the values. It turned out that looking at the detr repo at the Transformers classes definition: https://github.com/facebookresearch/detr/blob/main/models/transformer.py#L127, the forward outcome only yields the final tensor of dimensions (b, h * w, c) .

I don't know how you could get the hook's outcome from the colab's notebook. Is there any other code that the colab model used?

Did you figure this out?

@MLDeS I just used the model straight from detr = torch.hub.load('facebookresearch/detr', 'detr_resnet50', pretrained=True). Then, the hook from the detr.transformer.encoder.layers[-1].self_attn comes with two outputs, one is the features map and the other is the attention map.