[CV_Localization] Learning Deep Features for Discriminative Localization

Question

[CV_Localization] Learning Deep Features for Discriminative Localization

jeonggg119 opened this issue 3 years ago · 0 comments

jeonggg119 commented 3 years ago

Abstract

Global average pooling (GAP) : (previously) Regularizing training -> (CAM) generic localizable deep representaion

1. Introduction

CNN : classification, object detection Good but FC layer (flatten)-> ability to localize objects is lost
FCN (NIN), GoogLeNet : GAP as regularizer -> minimize # of params + maintain high performance
CAM : GAP for remarkable localization ability until final layer (deep features)

1.1 Related Work

localizing objects + identifying which regions of image are being used for discrimination

(1) Weakly-supervised object localization

Previous works : self-taught, multiple-instance learning, transferring mid-level image, multiple overlapping patches
-> No end-to-end training & Multiple forward pass -> difficult to scale real-world datasets
GMP (Global Max Pooling) : limited to lying in boundary of object rather than full extent
CAM : End-to-end training & Single forward pass & GAP (full extent, all discriminative regions)

(2) Visualizing CNNs

Previous works : Deconvnet (patterns activate each unit) -> Incomplete (only analyzing conv layers, ignoring fc layers)
CAM : Removing fc layers -> able to understand whole network (end-to-end)
Previous works : Inverting deep features at different layers (inverting fc layers)-> But No highlight relative importance
CAM : Highlight which regions are important for discrimination

2. Class Activation Mapping

Class Activation Map for each particular category indicates discriminative regions to identify category
Class Activation Mapping : CNN -> GAP on last conv layer (feature maps) -> fc layer -> Softmax final output
GAP : spatial average of feature map at last conv layer -> one weight for each channel (total : N weights for N channels)
CAM : Sum of N weights * N conv layers -> one heat map for each class
Result : Projecting back weights of output on conv feature maps -> can identify importance of image regions

f_k(x,y) : activation map (feature map) of unit k in last conv layer at spatial location (x,y)
F_k(x,y) : result of GAP
S_c : input to softmax for class c
w_k^c : weight for class c -> importance of F_k for class c
M_c(x,y) : CAM for class c -> importance of activation at (x,y) leading to classification of image to class c
CAM = weighted linear sum of visual patterns at different spatial locations -> Upsampling CAM to size of input !
P_c : output of softmax for class c

Global average pooling (GAP) vs global max pooling (GMP)

GAP : consider all discriminative parts of an object -> identify extent of object
GMP : consider only highest parts of an object
Classification performance : similar / Localization performance : GAP > GMP

3. Weakly-supervised Object Localization

3.1 Setup

Dataset : ILSVRC 2014
CNN models : AlexNet, VGGnet, GoogLeNet (remove fc layers -> replace them with GAP)
- Localization ability improved when last conv layer before GAP = high spatial resolution (mapping resolution)
- So, remove some layers -> add new layers (3 x 3, stride 1, pad 1 with 1024 units) followed by GAP
Networks were fine-tuned on 1.3M training images of ILSVRC

3.2 Results

(1) Classification

GAP : Only small performance drop (1-2%) without fc layers -> Acceptable

(2) Localizaion

bbox selection strategy : Simple thresholding technique (max 20% labeling -> bbox)
[Table 2] GAP : not trained on a single annotated bbox but outperforms than others (NIN, Backprop)
[Table 3] Weakly vs Fully-supervised methods
- bbox selection strategy (heuristics) : 2 bbox (one tight and one loose) from 1st and 2nd predicted classes + 1 loose bbox for top 3rd predicted class
- weakly-supervised GoogLeNet-GAP (heuristics) ~= fully-supervised AlexNet
- Same model -> still long way...

4. Deep Features for Generic Localization

Response from higher-level layers of CNN : effective generic features with SOTA on many image datasets
Response from GAP CNN : also perform well as generic features + highlight discriminative regions (without training)
- GoogLeNet-GAP, GoogLeNet > AlexNet
- GoogLeNet-GAP ~= GoogLeNet

4.1 Find-grained Recognition

Dataset : CUB-200-2011 (200 bird species)
Methods : GoogLeNet-GAP on full image < crop < bbox

4.2 Pattern Discovery

To identify common elements or patterns such as text or high-level concepts

(1) Discovering informative objects in the scenes

Dataset : 10 scene categories from SUN dataset
top 6 objects that most frequently overlap with high activation regions for two scene

(2) Concept localization in weakly labeled images

concept detector : localize informative regions for concepts, even phrases are more abstract than object names

(3) Weakly supervised text detector

Dataset : 350 Google StreetView images containing text from SVT dataset
highlight text without using bbox annotations

(4) Interpreting visual question answering (VQA)

overall acc : 55.89%
highlight image regions relevant to predicted answers

5. Visualizing Class-Specific Units

Using GAP and the ranked softmax weight
CAM : Visualize most discriminative units (Class-Specific Units) for a given class
Combination of Class-Specific Units guides CNN -> we can infer CNN actually learn!

6. Conclusion

CAM enables classification-trained CNNs with GAP to perform object localization without bbox annotations
CAM visualizes predicted class scores & highlights discriminative object parts
CAM generalizes to other visual recognition tasks

Code

def generate_cam(img_tensor, model, class_index, last_conv):
  
    model_input = model.input
    model_output = model.layers[-1].output

    # f_k(x, y) : 마지막 conv layer의 출력 feature map
    f_k = model.get_layer(last_conv).output
    get_output = K.function([model_input], [f_k])
    [last_conv_output] = get_output([img_tensor])

    # batch size가 포함되어 shape가 (1, width, height, k)이므로 (width, height, k)로 shape 변경
    last_conv_output = last_conv_output[0]

    # softmax(+ dense) layer와 GAP layer 사이의 weight matrix에서 class_index에 해당하는 class_weight_k(w^c_k)
    # ex) w^2_1, w^2_2, w^2_3, ..., w^2_k
    class_weight_k = model.layers[-1].get_weights()[0][:, class_index]


    # feature map(last_conv_output)의 (width, height)로 초기화
    cam = np.zeros(dtype=np.float32, shape=last_conv_output.shape[0:2])

   # last conv layer의 출력 feature map(last_conv_output)과 class_weight_k(w^c_k)로 weighted sum을 구함
    for k, w in enumerate(class_weight_k):
        cam += w * last_conv_output[:, :, k]

    return cam