[CV_Localization] Grad-CAM: Visual Explanations from Deep Networks via Gradient-based Localization

Question

[CV_Localization] Grad-CAM: Visual Explanations from Deep Networks via Gradient-based Localization

jeonggg119 opened this issue 3 years ago · 0 comments

jeonggg119 commented 3 years ago

Grad-CAM: Visual Explanations from Deep Networks via Gradient-based Localization

Abstract

visual explanations for decisions from CNN-> transparent, explainable
use gradients flowing into final conv to localization map -> highlight important regions in image for predicting concept
applicable to many CNN tasks without architectural changes or re-training
Classification
- lend insight into failure modes by reasonable explanations
- outperform previous methods
- robust to adversarial perturbations
- more faithful to basic model
- help generalization by dataset bias (for fair and bias-free outcomes)
Localization
- image captioning, VQA
- even non-attention based models
Human study
- appropriate trust in prediction from deep networks
- discern stronger vs weaker model even when identical prediction

1. Introduction

Transparent model to Explain why they predict what they predict
- AI evolution : (VQA) Identify failure / (classification) Establish appropriate trust / (chess) Teach human how to make better decisions
Trade-off bw accuracy and interpretability (simplicity)
- Classical model : interpretability ↑, accuracy ↓
- Deep model : interpretability ↓, accuracy ↑ BY greater abstraction (layers↑) and integration (end-to-end training)
CAM vs Grad-CAM
- CAM : constrained to model architecture (GAP -> fc)
- Grad-CAM : deep models without altering architecture (no trade-off) => Generalization of CAM
Guided Grad-CAM : class-discriminative & high-resolution = good visual explanation
- CAM, Grad-Cam : class-discriminative (localize)
- Guided backprop, Deconv : high-resolution (detail)

2. Related Work

Visualizing CNNs
Assessing model trust
Aligning gradient-based importance
Weakly-supervised localization : training without bbox information

3. Grad-CAM

last conv layer : high-level semantics (class-specific) & detailed spatial information
gradient flowing into last conv -> assign importance values to each neuron for a particular decision of interest

① Class score (before softmax) : y^c (could be any differentiable activation)
② Gradients of y^k wrt feature map activations A^k via backprop : dy^c/dA^k
③ Global average pooling -> Importance weight of feature map k for target class c : a_k^c

④ Weighted combination of forward activation maps
⑤ Apply ReLU b/c only interested in features of positive influence
-> result : coarse heatmap of same size as conv feature maps

3.1 Grad-CAM generalizes CAM

(수학적 증명)

3.2 Guided Grad-CAM

Grad-CAM : pixel-space detail ↓ -> unclear why network predicts particular instance
Guided Backprop : suppress negative gradients and visualize gradients through ReLU -> capture pixel by neurons
Guided Grad-CAM : combination by element-wise mul -> both high-resolution & class-discriminative + less noisy then deconv

3.3 Counterfactual Explanations

가장 방해되는 것 무엇?
DL model : background가 아닌 foreground로 판단

4. Evaluating Localization Ability of Grad-CAM

4.1 Weakly-supervised Localization

Weakly-supervised Localization : training without bbox information
Given image -> Obtain class predictions -> Generate Grad-CAM maps for each predicted classes -> Binarize pixels with thresh of 15% of max intensity -> Draw bbox around single largest segment
Grad-CAM localization error < others
No change model structure or re-train -> No compromise on classification performance!

4.2 Weakly-supervised Segmentation

Semantic Segmentation : assign each pixel in image an object class -> expensive pixel-level annotation
Weakly-supervised Segmentation : segment object with image-level annotation -> cheap and easy to get data
SEC with CAM : sensitive to choice of weak localization seed -> SEC with Grad-CAM : (IoU : 44.6 -> 49.6)

4.3 Pointing Game

Why : To evaluate discriminativeness of visualization method for localizing objects
How : Extract maximally activated point on generated heatmap -> compare with target label -> Count # of Hit or Miss

Acc = Hit # / (Hit # + Miss #) ... only measure Precision
For Recall, compute localization maps for top-5 class predictions -> evaluate them with additional option

option : reject predictions below a threshold (absent from GT)
Result : Grad-CAM > c-MWP (70.58% > 60.30%)

5. Evaluating Visualizations

interpretability vs. faithfulness tradeoff

5.1 Class Discrimination

Dataset : PASAL VOC 2007 - 2 annotated categories
CNN model : VGG-16, AlexNet
Method(Human Acc) : Deconv(53.33%), Guided backprop(44.44%), Deconv Grad-CAM(60.37%), Guided Grad-CAM(61.23%)

5.2 Trust

CNN model : VGG-16, AlexNet <- both models making same prediction as GT
Method: Guided backprop, Guided Grad-CAM
Evaluation : rating reliability of models relative to each other
Result : Guided backprop (VGG-16 : 1.00), Guided Grad-CAM (VGG-16 : 1.27) => VGG is more reliable than AlexNet
Grad-CAM can place trust in model that generalizes better than individual prediction explanations

5.3 Faithfulness vs Interpretability

Trade-off : More faithful, Less interpretable and vice versa
Grad-CAM are reasonably interpretable, so evaluate how faithful!
- Faithfulness : ability to accurately explain function
- Reference explanation with high local-faithfulness : correlation with Image occlusion maps
- Result : Grad-CAM is more faithful than original model
Grad-CAM is more Faithful and more Interpretable

6. Diagnosing image classification CNNs with Grad-CAM

VGG-16 pretrained on imagenet

6.1 Analyzing failure models for VGG-16

Some failures are due to ambiguities inherent in ImageNet classification
Guided Grad-CAM has reasonable explanations for failure predictions

6.2 Effect of adversarial noise on VGG-16

Dataset : adversarial images for ImageNet-pretrained VGG-16
Result : Despite network being certain about absence of each category, correctly localize! -> fairly robust to adversarial noise

6.3 Identifying bias in dataset

Task : binary classification of doctor' vs 'nurse'
Biased model : misclassifying by gender stereotype (face / hairstyle) => good validation acc, but not good for generalization
Reduced biased model : generalization better (82% → 90%)
Insight: Grad-CAM can help detect and reduce bias in training datasets -> better generalizaion, fair and eithical outcome

7. Textual Explanations with Grad-CAM

obtain neuron names for last conv layer -> sort and obtain top-5 and bottom-5 neurons -> use for text explanations
higher positive values of neuron importance => presence of concept increases in class score
important concepts are indicative of predicted class even for misclassification

8. Grad-CAM for Image Captioning and VQA

vision & language tasks

8.1 Image Captioning

finetuned VGG-16 for images, LSTM-based language model (no explicit attention mechanism)
compute gradient of log probability wrt units in last conv layer -> generate Grad-CAM visualizations
FCLN produces bbox for rol & LSTM-based model generates associated captions
DenseCap generates 5 captions per image with GT bbox
Then, Guided Grad-CAM localizes regions without trained with bbox annotations

8.2 Visual Question Answering

CNN for processing images & RNN language model for questions
image and question are fused to predict answer
Result : Grad-CAM via correlation with occlusion maps : 0.60+-0.038 -> high faithfulness

9. Conclusion

Grad-CAM (Gradient-weighted Class Activation Mapping) : class-discriminative localization technique for making any CNN model more transparent by visual explanations
Guided Grad-CAM : Both high resolution + class-discriminative -> interpretability + faithfulness
AI should be able to reason about its belief and actions for human to trust and use it!

Code Review

def generate_gradcam(img_tensor, model, class_index, activation_layer):
    model_input = model.input

    # y_c : class_index에 해당하는 CNN 마지막 layer op(softmax, linear, ...)의 입력
    y_c = model.output[0, class_index]

    # A_k: activation conv layer의 출력 feature map
    A_k = model.get_layer(activation_layer).output

    # model의 입력에 대해서,
    # activation conv layer의 출력(A_k)과
    # 최종 layer activation 입력(y_c)의 A_k에 대한 gradient,
    # 모델의 최종 출력(prediction) 계산
    get_output = K.function([model_input], [A_k, K.gradients(y_c, A_k)[0]])
    [conv_output, grad_val] = get_output([img_tensor])

    # batch size가 포함되어 shape가 (1, width, height, k)이므로
    # (width, height, k)로 shape 변경
    # 여기서 width, height는 activation conv layer인 A_k feature map의 width와 height를 의미함
    conv_output = conv_output[0]
    grad_val = grad_val[0]

    # global average pooling 연산
    # gradient의 width, height에 대해 평균을 구해서(1/Z) weights(a^c_k) 계산
    weights = np.mean(grad_val, axis=(0, 1))

    # activation conv layer의 출력 feature map(conv_output)과
    # class_index에 해당하는 weights(a^c_k)를 k에 대응해서 weighted combination 계산

    # feature map(conv_output)의 (width, height)로 초기화
    grad_cam = np.zeros(dtype=np.float32, shape=conv_output.shape[0:2])
    for k, w in enumerate(weights):
        grad_cam += w * conv_output[:, :, k]

    # 계산된 weighted combination 에 ReLU 적용
    grad_cam = np.maximum(grad_cam, 0)

    return grad_cam, weights

def make_gradcam_heatmap(img_array, model, last_conv_layer_name, pred_index=None):
    # First, we create a model that maps the input image to the activations
    # of the last conv layer as well as the output predictions
    grad_model = tf.keras.models.Model(
        [model.inputs], [model.get_layer(last_conv_layer_name).output, model.output]
    )

    # Then, we compute the gradient of the top predicted class for our input image
    # with respect to the activations of the last conv layer
    with tf.GradientTape() as tape:
        last_conv_layer_output, preds = grad_model(img_array)
        if pred_index is None:
            pred_index = tf.argmax(preds[0])
        class_channel = preds[:, pred_index]

    # This is the gradient of the output neuron (top predicted or chosen)
    # with regard to the output feature map of the last conv layer
    grads = tape.gradient(class_channel, last_conv_layer_output)

    # This is a vector where each entry is the mean intensity of the gradient
    # over a specific feature map channel
    pooled_grads = tf.reduce_mean(grads, axis=(0, 1, 2))

    # We multiply each channel in the feature map array
    # by "how important this channel is" with regard to the top predicted class
    # then sum all the channels to obtain the heatmap class activation
    last_conv_layer_output = last_conv_layer_output[0]
    heatmap = last_conv_layer_output @ pooled_grads[..., tf.newaxis]
    heatmap = tf.squeeze(heatmap)

    # For visualization purpose, we will also normalize the heatmap between 0 & 1
    heatmap = tf.maximum(heatmap, 0) / tf.math.reduce_max(heatmap)
    return heatmap.numpy()