TransAttention performance

Thanks for the great work.

I have a question about how to reproduce TransAttention performance on CUB-200 (Table 5). I got much higher performance by changing

TS-CAM/lib/models/deit.py

Line 62 in aeb823e

cams = cams * feature_map # B * C * 14 * 14

to
cams = cams.repeat((1, 200, 1, 1)):

Cls@1:0.803 Cls@5:0.948
Loc@1:0.690 Loc@5:0.816 Loc_gt:0.859
wrong_details:3998 1139 0 556 96 5

And, I got Loc@1:0.154 Loc@5:0.177 Loc_gt:0.183 for TransCAM, seems there's a mistake in the table. Personally, I feel it's unfair to compare with TransCAM without tuning CAM_THR to its optimal, I can get Loc@1:0.333 Loc@5:0.379 Loc_gt:0.387 by setting CAM_THR to around 0.8, I wonder your thoughts here.

Hi, thanks for your attention and your reviews. The following are my answers.

You got a much higher performance by changing cams = cams * feature_map to cams = cams.repeat((1,200,1,1)). I try to answer your question from two views.

Firstly, the localization accuracies are also still lower than the TS-CAM.
Secondly, there are two version implementations for TransAttention. For the first one, we just fine-tune Deit-S on CUB-200-2011 dataset, predict the image classification logits based on Class Token, then calculate the cross-entropy loss between the logits and ground-truth labels. For the second implementation, we get the image classifications logits based on average pooling all the Patch Tokens but excluding Class Token. Our results in Table 5 in our paper are from the first implementation. And your result is from the second implementation.

As for your second question, you got higher performances for TransCAM, with higher CAM_THR. However, the localization accuracies are still much lower than TS-CAM, and also lags behind the CNN-CAM method. So, this doesn't affect the conclusion. Also, for a fair comparison, we try to settle the CAM_THR the same.