TransAttention performance
liruiwen opened this issue · 1 comments
liruiwen commented
Thanks for the great work.
I have a question about how to reproduce TransAttention performance on CUB-200 (Table 5). I got much higher performance by changing
Line 62 in aeb823e
cams = cams.repeat((1, 200, 1, 1)):
Cls@1:0.803 Cls@5:0.948
Loc@1:0.690 Loc@5:0.816 Loc_gt:0.859
wrong_details:3998 1139 0 556 96 5
And, I got Loc@1:0.154 Loc@5:0.177 Loc_gt:0.183 for TransCAM, seems there's a mistake in the table. Personally, I feel it's unfair to compare with TransCAM without tuning CAM_THR to its optimal, I can get Loc@1:0.333 Loc@5:0.379 Loc_gt:0.387 by setting CAM_THR to around 0.8, I wonder your thoughts here.
vasgaowei commented
Hi, thanks for your attention and your reviews. The following are my answers.
- You got a much higher performance by changing
cams = cams * feature_map
tocams = cams.repeat((1,200,1,1))
. I try to answer your question from two views.
- Firstly, the localization accuracies are also still lower than the TS-CAM.
- Secondly, there are two version implementations for TransAttention. For the first one, we just fine-tune Deit-S on CUB-200-2011 dataset, predict the image classification logits based on Class Token, then calculate the cross-entropy loss between the logits and ground-truth labels. For the second implementation, we get the image classifications logits based on average pooling all the Patch Tokens but excluding Class Token. Our results in Table 5 in our paper are from the first implementation. And your result is from the second implementation.
- As for your second question, you got higher performances for TransCAM, with higher CAM_THR. However, the localization accuracies are still much lower than TS-CAM, and also lags behind the CNN-CAM method. So, this doesn't affect the conclusion. Also, for a fair comparison, we try to settle the CAM_THR the same.