hila-chefer/Transformer-Explainability

Questions regarding the evaluation process

zmy1116 opened this issue · 2 comments

Hello ,

First, thank you for this great work. It definitely pushes the xai for transformer to a new frontier, and it inspires many people (including me) to work in this area!

I have some question regarding to the evaluation process and I would appreciate your inputs:

  1. To evaluate segmentation performance over ImageNet data, we are using the annotated segmentation from Guillaumin et al. 2014 , which consists of 445 imagenet classes of 4276 images. One issue I see is that not all these 445 imagenet classes are part of the standard 1000 classes imagenet dataset (the ones we use to train the VIT or the DEIT ). There are some overlaps, but many classes are either hypernyms/hyponyms/or simply different.

From what I observe , many "errors" in the segmentation are simply because the model misclassfied the image and showed the explanation for the predicted class instead of the GT class. So the explanability error is caused by the model itself instead of the explanability method. Do you think it is useful to separate out cases where the model prediction is incorrect in the first place ? (so we can purely evaluate the XAI method)

I guess this question also applies for other evaluation test we are doing, like pos/neg perturbations. For the segmentation task, we don't have the matching label classes so I understand it is difficult to separate out cases where model made a mistake. However, for other tests (including the ones on NLP tasks), we do have the classification labels. Is there a reason we don't treat misclassification cases separately?

  1. For segmentation evaluation, I have a question for the code to calculate average precision:

we first concatenate the (1 - Res) and Res to form a tensor output_AP of size 1x2x224x224.

output_AP = torch.cat((Res_0_AP, Res_1_AP), 1)

We then compute AP using output_AP and labels (of shape 1x224x224)

ap = np.nan_to_num(get_ap_scores(output_AP, labels))

My question is the get_ap_scores:

We iterate every record, as the batch size is 1, each pred has shape 2x224x224 and each tgt has shape 224x224

for pred, tgt in zip(predict, target):

At line 91, we flattened pred of size (224x224+ 224x224), so the first 224x224 are 1-Res, the second 224x224 are Res,

and we want to compare against a target of size 224x224 + 224x224 where the second 224x224 are the tgt, the first set is 1-tgt

predict_flat = pred.data.cpu().numpy().reshape(-1)

target_flat = target_1hot.data.cpu().numpy().reshape(-1)

and we use average_precision_score

total.append(np.nan_to_num(average_precision_score(t, p)))

I think it may not be correct,

it's like having prediction and target to be
[0.1, 0.2, 0.5 0.3] and [0, 0, 1, 1]
then we append the opposite to the 2 lists and compute [0.1, 0.2, 0.5 0.3, 0.9, 0.8, 0.5 0.7] and [0, 0, 1, 1, 1,1 ,0, 0 ] and compute AP...

Maybe the problem is that I got the input tensor shape wrong? I've seen several paper using your evaluation code to produce results
for example, the following work in last year neurips, I can confirm that they use the exact input tensor format as I
described above because I'm able to reproduce their results perfectly. (and the AP looks very wrong)
https://github.com/XianrenYty/Transition_Attention_Maps

Sorry for the long post.. I would appreciate if you can take a look.. I've seen multiple xai works using this exact evaluation code ... and I'm trying to make comparison with my own research,

Thank you

Hi @zmy1116, thanks for your interest and all your kind words! I’m so glad to hear you found our work inspiring!

I’ll do my best to answer your questions, feel free to ask for clarifications.

  1. personally, segmentation tests are not my favorite type of test for explanations of image classification models. The reason is that these tests examine whether the explanations resemble human explanations (i.e. segmentation). However, this does not necessarily guarantee a faithful explanation. The evaluation should generally be for the faithfulness in reflecting the model’s logic IMO. However, this test has been widely used in previous works, and the baselines tend to highlight unrelated pixels, so we opted to use this test as it was presented in previous works. For perturbation, we had the luxury of examining cases where the model was wrong by using the ground truth class as well as the predicted class for relevance propagation.
  2. I’m not really sure I follow your logic, but we used the RAP implementation for our segmentation tests.

I hope this helps clarify things a bit more.
Additionally, I’d greatly appreciate it if you consider our second work as well when comparing with current baselines :) it was published in ICCV ‘21.

Best,
Hila.

@zmy1116 closing this issue due to inactivity, please reopen if necessary :)