[CV_Localization] Learning Deep Features for Discriminative Localization
jeonggg119 opened this issue · 0 comments
jeonggg119 commented
Abstract
- Global average pooling (GAP) : (previously) Regularizing training -> (CAM) generic localizable deep representaion
1. Introduction
- CNN : classification, object detection Good but FC layer (flatten)-> ability to localize objects is lost
- FCN (NIN), GoogLeNet : GAP as regularizer -> minimize # of params + maintain high performance
- CAM : GAP for remarkable localization ability until final layer (deep features)
1.1 Related Work
localizing objects + identifying which regions of image are being used for discrimination
(1) Weakly-supervised object localization
- Previous works : self-taught, multiple-instance learning, transferring mid-level image, multiple overlapping patches
-> No end-to-end training & Multiple forward pass -> difficult to scale real-world datasets - GMP (Global Max Pooling) : limited to lying in boundary of object rather than full extent
- CAM : End-to-end training & Single forward pass & GAP (full extent, all discriminative regions)
(2) Visualizing CNNs
- Previous works : Deconvnet (patterns activate each unit) -> Incomplete (only analyzing conv layers, ignoring fc layers)
- CAM : Removing fc layers -> able to understand whole network (end-to-end)
- Previous works : Inverting deep features at different layers (inverting fc layers)-> But No highlight relative importance
- CAM : Highlight which regions are important for discrimination
2. Class Activation Mapping
- Class Activation Map for each particular category indicates discriminative regions to identify category
- Class Activation Mapping : CNN -> GAP on last conv layer (feature maps) -> fc layer -> Softmax final output
GAP : spatial average of feature map at last conv layer -> one weight for each channel (total : N weights for N channels)
CAM : Sum of N weights * N conv layers -> one heat map for each class - Result : Projecting back weights of output on conv feature maps -> can identify importance of image regions
- f_k(x,y) : activation map (feature map) of unit k in last conv layer at spatial location (x,y)
- F_k(x,y) : result of GAP
- S_c : input to softmax for class c
- w_k^c : weight for class c -> importance of F_k for class c
- M_c(x,y) : CAM for class c -> importance of activation at (x,y) leading to classification of image to class c
CAM = weighted linear sum of visual patterns at different spatial locations -> Upsampling CAM to size of input ! - P_c : output of softmax for class c
Global average pooling (GAP) vs global max pooling (GMP)
- GAP : consider all discriminative parts of an object -> identify extent of object
- GMP : consider only highest parts of an object
- Classification performance : similar / Localization performance : GAP > GMP
3. Weakly-supervised Object Localization
3.1 Setup
- Dataset : ILSVRC 2014
- CNN models : AlexNet, VGGnet, GoogLeNet (remove fc layers -> replace them with GAP)
- Localization ability improved when last conv layer before GAP = high spatial resolution (mapping resolution)
- So, remove some layers -> add new layers (3 x 3, stride 1, pad 1 with 1024 units) followed by GAP
- Networks were fine-tuned on 1.3M training images of ILSVRC
3.2 Results
(1) Classification
- GAP : Only small performance drop (1-2%) without fc layers -> Acceptable
(2) Localizaion
- bbox selection strategy : Simple thresholding technique (max 20% labeling -> bbox)
- [Table 2] GAP : not trained on a single annotated bbox but outperforms than others (NIN, Backprop)
- [Table 3] Weakly vs Fully-supervised methods
- bbox selection strategy (heuristics) : 2 bbox (one tight and one loose) from 1st and 2nd predicted classes + 1 loose bbox for top 3rd predicted class
- weakly-supervised GoogLeNet-GAP (heuristics) ~= fully-supervised AlexNet
- Same model -> still long way...
4. Deep Features for Generic Localization
- Response from higher-level layers of CNN : effective generic features with SOTA on many image datasets
- Response from GAP CNN : also perform well as generic features + highlight discriminative regions (without training)
- GoogLeNet-GAP, GoogLeNet > AlexNet
- GoogLeNet-GAP ~= GoogLeNet
4.1 Find-grained Recognition
- Dataset : CUB-200-2011 (200 bird species)
- Methods : GoogLeNet-GAP on full image < crop < bbox
4.2 Pattern Discovery
- To identify common elements or patterns such as text or high-level concepts
(1) Discovering informative objects in the scenes
- Dataset : 10 scene categories from SUN dataset
- top 6 objects that most frequently overlap with high activation regions for two scene
(2) Concept localization in weakly labeled images
- concept detector : localize informative regions for concepts, even phrases are more abstract than object names
(3) Weakly supervised text detector
- Dataset : 350 Google StreetView images containing text from SVT dataset
- highlight text without using bbox annotations
(4) Interpreting visual question answering (VQA)
- overall acc : 55.89%
- highlight image regions relevant to predicted answers
5. Visualizing Class-Specific Units
- Using GAP and the ranked softmax weight
- CAM : Visualize most discriminative units (Class-Specific Units) for a given class
- Combination of Class-Specific Units guides CNN -> we can infer CNN actually learn!
6. Conclusion
- CAM enables classification-trained CNNs with GAP to perform object localization without bbox annotations
- CAM visualizes predicted class scores & highlights discriminative object parts
- CAM generalizes to other visual recognition tasks
Code
def generate_cam(img_tensor, model, class_index, last_conv):
model_input = model.input
model_output = model.layers[-1].output
# f_k(x, y) : 마지막 conv layer의 출력 feature map
f_k = model.get_layer(last_conv).output
get_output = K.function([model_input], [f_k])
[last_conv_output] = get_output([img_tensor])
# batch size가 포함되어 shape가 (1, width, height, k)이므로 (width, height, k)로 shape 변경
last_conv_output = last_conv_output[0]
# softmax(+ dense) layer와 GAP layer 사이의 weight matrix에서 class_index에 해당하는 class_weight_k(w^c_k)
# ex) w^2_1, w^2_2, w^2_3, ..., w^2_k
class_weight_k = model.layers[-1].get_weights()[0][:, class_index]
# feature map(last_conv_output)의 (width, height)로 초기화
cam = np.zeros(dtype=np.float32, shape=last_conv_output.shape[0:2])
# last conv layer의 출력 feature map(last_conv_output)과 class_weight_k(w^c_k)로 weighted sum을 구함
for k, w in enumerate(class_weight_k):
cam += w * last_conv_output[:, :, k]
return cam