/Grad-CAM

Gradient-weighted Class Activation Mapping (Grad-CAM)

Primary LanguageJupyter NotebookMIT LicenseMIT

Gradient-weighted Class Activation Mapping (Grad-CAM)

Deep neural networks have enabled exceptional breakthroughs in a variety of tasks. However, their lack of explanatory power makes them hard to interpret. Moreover, in the case of failure, they leave users wondering why! So, building models that are explainable is crucial to establish appropriate trust and confidence in users. Also, the explanatory power allows researchers to redirect their effort towards the main causing problem. Gradient-weighted Class Activation Mapping (Grad-CAM) is a class-discriminative localization technique for making any convolutional neural network model more transparent by producing visual explanations (Selvaraju et. al., 2017).

Overview

Grad-CAM uses the gradient information flowing into the last convolutional layer of the model to obtain localization map and understand the importance of each pixel of the input image for a specific class. Let’s assume, represents the localization map with width and height for class . To calculate , the gradient of the score for class (before softmax), with respect to feature map , , of the last convolutional layer is calculated and global average pooled to obtain neuron importance weight, :

Furthermore, a weighted combination of forward activation maps followed by ReLU is obtained:

This results in a coarse heatmap of the same size as the convolutional feature maps. Using ReLU allows us to capture features that have positive influence on the class of interest, i.e. pixels whose intensity should be increased in order to increase . Negative pixels are likely to belong to other categories in the image. Without ReLU, localization maps sometimes highlight more than just the desired class and achieve lower localization performance.

Results

Bellow images show Grad-CAM visualizations for two samples from CIFAR10. The first one belongs to category “deer” and the second one belongs to category “ship”:

From Grad-CAM visualizations, it can be concluded that the trained model looks for horn to identify a deer and exhaust pipe/mast to detect a ship. This provides explanation why the model failed to identify the right category in the following cases:

Reference:

Salvaraju, R., Cogswell, M., Das, A., Vedantam, R., Parikh, D., and Batra, D. (2017). Grad-CAM: Visual Explanations from Deep Networks via Gradient-based Localization. IEEE International Conference on Computer Vision (ICCV) (https://arxiv.org/abs/1610.02391)