[CV_Localization] Grad-CAM: Visual Explanations from Deep Networks via Gradient-based Localization
jeonggg119 opened this issue · 0 comments
jeonggg119 commented
Grad-CAM: Visual Explanations from Deep Networks via Gradient-based Localization
Abstract
- visual explanations for decisions from CNN-> transparent, explainable
- use gradients flowing into final conv to localization map -> highlight important regions in image for predicting concept
- applicable to many CNN tasks without architectural changes or re-training
- Classification
- lend insight into failure modes by reasonable explanations
- outperform previous methods
- robust to adversarial perturbations
- more faithful to basic model
- help generalization by dataset bias (for fair and bias-free outcomes)
- Localization
- image captioning, VQA
- even non-attention based models
- Human study
- appropriate trust in prediction from deep networks
- discern stronger vs weaker model even when identical prediction
1. Introduction
- Transparent model to Explain why they predict what they predict
- AI evolution : (VQA) Identify failure / (classification) Establish appropriate trust / (chess) Teach human how to make better decisions
- Trade-off bw accuracy and interpretability (simplicity)
- Classical model : interpretability ↑, accuracy ↓
- Deep model : interpretability ↓, accuracy ↑ BY greater abstraction (layers↑) and integration (end-to-end training)
- CAM vs Grad-CAM
- CAM : constrained to model architecture (GAP -> fc)
- Grad-CAM : deep models without altering architecture (no trade-off) => Generalization of CAM
- Guided Grad-CAM : class-discriminative & high-resolution = good visual explanation
- CAM, Grad-Cam : class-discriminative (localize)
- Guided backprop, Deconv : high-resolution (detail)
2. Related Work
- Visualizing CNNs
- Assessing model trust
- Aligning gradient-based importance
- Weakly-supervised localization : training without bbox information
3. Grad-CAM
- last conv layer : high-level semantics (class-specific) & detailed spatial information
- gradient flowing into last conv -> assign importance values to each neuron for a particular decision of interest
① Class score (before softmax) : y^c (could be any differentiable activation)
② Gradients of y^k wrt feature map activations A^k via backprop : dy^c/dA^k
③ Global average pooling -> Importance weight of feature map k for target class c : a_k^c
④ Weighted combination of forward activation maps
⑤ Apply ReLU b/c only interested in features of positive influence
-> result : coarse heatmap of same size as conv feature maps
3.1 Grad-CAM generalizes CAM
(수학적 증명)
3.2 Guided Grad-CAM
- Grad-CAM : pixel-space detail ↓ -> unclear why network predicts particular instance
- Guided Backprop : suppress negative gradients and visualize gradients through ReLU -> capture pixel by neurons
- Guided Grad-CAM : combination by element-wise mul -> both high-resolution & class-discriminative + less noisy then deconv
3.3 Counterfactual Explanations
- 가장 방해되는 것 무엇?
- DL model : background가 아닌 foreground로 판단
4. Evaluating Localization Ability of Grad-CAM
4.1 Weakly-supervised Localization
- Weakly-supervised Localization : training without bbox information
- Given image -> Obtain class predictions -> Generate Grad-CAM maps for each predicted classes -> Binarize pixels with thresh of 15% of max intensity -> Draw bbox around single largest segment
- Grad-CAM localization error < others
- No change model structure or re-train -> No compromise on classification performance!
4.2 Weakly-supervised Segmentation
- Semantic Segmentation : assign each pixel in image an object class -> expensive pixel-level annotation
- Weakly-supervised Segmentation : segment object with image-level annotation -> cheap and easy to get data
- SEC with CAM : sensitive to choice of weak localization seed -> SEC with Grad-CAM : (IoU : 44.6 -> 49.6)
4.3 Pointing Game
- Why : To evaluate discriminativeness of visualization method for localizing objects
- How : Extract maximally activated point on generated heatmap -> compare with target label -> Count # of Hit or Miss
Acc = Hit # / (Hit # + Miss #) ... only measure Precision
- For Recall, compute localization maps for top-5 class predictions -> evaluate them with additional option
option : reject predictions below a threshold (absent from GT)
- Result : Grad-CAM > c-MWP (70.58% > 60.30%)
5. Evaluating Visualizations
- interpretability vs. faithfulness tradeoff
5.1 Class Discrimination
- Dataset : PASAL VOC 2007 - 2 annotated categories
- CNN model : VGG-16, AlexNet
- Method(Human Acc) : Deconv(53.33%), Guided backprop(44.44%), Deconv Grad-CAM(60.37%), Guided Grad-CAM(61.23%)
5.2 Trust
- CNN model : VGG-16, AlexNet <- both models making same prediction as GT
- Method: Guided backprop, Guided Grad-CAM
- Evaluation : rating reliability of models relative to each other
- Result : Guided backprop (VGG-16 : 1.00), Guided Grad-CAM (VGG-16 : 1.27) => VGG is more reliable than AlexNet
- Grad-CAM can place trust in model that generalizes better than individual prediction explanations
5.3 Faithfulness vs Interpretability
- Trade-off : More faithful, Less interpretable and vice versa
- Grad-CAM are reasonably interpretable, so evaluate how faithful!
- Faithfulness : ability to accurately explain function
- Reference explanation with high local-faithfulness : correlation with Image occlusion maps
- Result : Grad-CAM is more faithful than original model
- Grad-CAM is more Faithful and more Interpretable
6. Diagnosing image classification CNNs with Grad-CAM
- VGG-16 pretrained on imagenet
6.1 Analyzing failure models for VGG-16
- Some failures are due to ambiguities inherent in ImageNet classification
- Guided Grad-CAM has reasonable explanations for failure predictions
6.2 Effect of adversarial noise on VGG-16
- Dataset : adversarial images for ImageNet-pretrained VGG-16
- Result : Despite network being certain about absence of each category, correctly localize! -> fairly robust to adversarial noise
6.3 Identifying bias in dataset
- Task : binary classification of doctor' vs 'nurse'
- Biased model : misclassifying by gender stereotype (face / hairstyle) => good validation acc, but not good for generalization
- Reduced biased model : generalization better (82% → 90%)
- Insight: Grad-CAM can help detect and reduce bias in training datasets -> better generalizaion, fair and eithical outcome
7. Textual Explanations with Grad-CAM
- obtain neuron names for last conv layer -> sort and obtain top-5 and bottom-5 neurons -> use for text explanations
- higher positive values of neuron importance => presence of concept increases in class score
- important concepts are indicative of predicted class even for misclassification
8. Grad-CAM for Image Captioning and VQA
- vision & language tasks
8.1 Image Captioning
- finetuned VGG-16 for images, LSTM-based language model (no explicit attention mechanism)
- compute gradient of log probability wrt units in last conv layer -> generate Grad-CAM visualizations
- FCLN produces bbox for rol & LSTM-based model generates associated captions
- DenseCap generates 5 captions per image with GT bbox
- Then, Guided Grad-CAM localizes regions without trained with bbox annotations
8.2 Visual Question Answering
- CNN for processing images & RNN language model for questions
- image and question are fused to predict answer
- Result : Grad-CAM via correlation with occlusion maps : 0.60+-0.038 -> high faithfulness
9. Conclusion
- Grad-CAM (Gradient-weighted Class Activation Mapping) : class-discriminative localization technique for making any CNN model more transparent by visual explanations
- Guided Grad-CAM : Both high resolution + class-discriminative -> interpretability + faithfulness
- AI should be able to reason about its belief and actions for human to trust and use it!
Code Review
def generate_gradcam(img_tensor, model, class_index, activation_layer):
model_input = model.input
# y_c : class_index에 해당하는 CNN 마지막 layer op(softmax, linear, ...)의 입력
y_c = model.output[0, class_index]
# A_k: activation conv layer의 출력 feature map
A_k = model.get_layer(activation_layer).output
# model의 입력에 대해서,
# activation conv layer의 출력(A_k)과
# 최종 layer activation 입력(y_c)의 A_k에 대한 gradient,
# 모델의 최종 출력(prediction) 계산
get_output = K.function([model_input], [A_k, K.gradients(y_c, A_k)[0]])
[conv_output, grad_val] = get_output([img_tensor])
# batch size가 포함되어 shape가 (1, width, height, k)이므로
# (width, height, k)로 shape 변경
# 여기서 width, height는 activation conv layer인 A_k feature map의 width와 height를 의미함
conv_output = conv_output[0]
grad_val = grad_val[0]
# global average pooling 연산
# gradient의 width, height에 대해 평균을 구해서(1/Z) weights(a^c_k) 계산
weights = np.mean(grad_val, axis=(0, 1))
# activation conv layer의 출력 feature map(conv_output)과
# class_index에 해당하는 weights(a^c_k)를 k에 대응해서 weighted combination 계산
# feature map(conv_output)의 (width, height)로 초기화
grad_cam = np.zeros(dtype=np.float32, shape=conv_output.shape[0:2])
for k, w in enumerate(weights):
grad_cam += w * conv_output[:, :, k]
# 계산된 weighted combination 에 ReLU 적용
grad_cam = np.maximum(grad_cam, 0)
return grad_cam, weights
def make_gradcam_heatmap(img_array, model, last_conv_layer_name, pred_index=None):
# First, we create a model that maps the input image to the activations
# of the last conv layer as well as the output predictions
grad_model = tf.keras.models.Model(
[model.inputs], [model.get_layer(last_conv_layer_name).output, model.output]
)
# Then, we compute the gradient of the top predicted class for our input image
# with respect to the activations of the last conv layer
with tf.GradientTape() as tape:
last_conv_layer_output, preds = grad_model(img_array)
if pred_index is None:
pred_index = tf.argmax(preds[0])
class_channel = preds[:, pred_index]
# This is the gradient of the output neuron (top predicted or chosen)
# with regard to the output feature map of the last conv layer
grads = tape.gradient(class_channel, last_conv_layer_output)
# This is a vector where each entry is the mean intensity of the gradient
# over a specific feature map channel
pooled_grads = tf.reduce_mean(grads, axis=(0, 1, 2))
# We multiply each channel in the feature map array
# by "how important this channel is" with regard to the top predicted class
# then sum all the channels to obtain the heatmap class activation
last_conv_layer_output = last_conv_layer_output[0]
heatmap = last_conv_layer_output @ pooled_grads[..., tf.newaxis]
heatmap = tf.squeeze(heatmap)
# For visualization purpose, we will also normalize the heatmap between 0 & 1
heatmap = tf.maximum(heatmap, 0) / tf.math.reduce_max(heatmap)
return heatmap.numpy()