jeonggg119/DL_paper

[CV_Localization] Learning Deep Features for Discriminative Localization

jeonggg119 opened this issue · 0 comments

Abstract

  • Global average pooling (GAP) : (previously) Regularizing training -> (CAM) generic localizable deep representaion

1. Introduction

  • CNN : classification, object detection Good but FC layer (flatten)-> ability to localize objects is lost
  • FCN (NIN), GoogLeNet : GAP as regularizer -> minimize # of params + maintain high performance
  • CAM : GAP for remarkable localization ability until final layer (deep features)

1.1 Related Work

localizing objects + identifying which regions of image are being used for discrimination

(1) Weakly-supervised object localization

  • Previous works : self-taught, multiple-instance learning, transferring mid-level image, multiple overlapping patches
    -> No end-to-end training & Multiple forward pass -> difficult to scale real-world datasets
  • GMP (Global Max Pooling) : limited to lying in boundary of object rather than full extent
  • CAM : End-to-end training & Single forward pass & GAP (full extent, all discriminative regions)

(2) Visualizing CNNs

  • Previous works : Deconvnet (patterns activate each unit) -> Incomplete (only analyzing conv layers, ignoring fc layers)
  • CAM : Removing fc layers -> able to understand whole network (end-to-end)
  • Previous works : Inverting deep features at different layers (inverting fc layers)-> But No highlight relative importance
  • CAM : Highlight which regions are important for discrimination

2. Class Activation Mapping

image

  • Class Activation Map for each particular category indicates discriminative regions to identify category
  • Class Activation Mapping : CNN -> GAP on last conv layer (feature maps) -> fc layer -> Softmax final output
    GAP : spatial average of feature map at last conv layer -> one weight for each channel (total : N weights for N channels)
    CAM : Sum of N weights * N conv layers -> one heat map for each class
  • Result : Projecting back weights of output on conv feature maps -> can identify importance of image regions

image

image
image
image

  • f_k(x,y) : activation map (feature map) of unit k in last conv layer at spatial location (x,y)
  • F_k(x,y) : result of GAP
  • S_c : input to softmax for class c
  • w_k^c : weight for class c -> importance of F_k for class c
  • M_c(x,y) : CAM for class c -> importance of activation at (x,y) leading to classification of image to class c
    CAM = weighted linear sum of visual patterns at different spatial locations -> Upsampling CAM to size of input !
  • P_c : output of softmax for class c

Global average pooling (GAP) vs global max pooling (GMP)

  • GAP : consider all discriminative parts of an object -> identify extent of object
  • GMP : consider only highest parts of an object
  • Classification performance : similar / Localization performance : GAP > GMP

3. Weakly-supervised Object Localization

3.1 Setup

  • Dataset : ILSVRC 2014
  • CNN models : AlexNet, VGGnet, GoogLeNet (remove fc layers -> replace them with GAP)
    • Localization ability improved when last conv layer before GAP = high spatial resolution (mapping resolution)
    • So, remove some layers -> add new layers (3 x 3, stride 1, pad 1 with 1024 units) followed by GAP
  • Networks were fine-tuned on 1.3M training images of ILSVRC

3.2 Results

(1) Classification

image

  • GAP : Only small performance drop (1-2%) without fc layers -> Acceptable

(2) Localizaion

image
image

  • bbox selection strategy : Simple thresholding technique (max 20% labeling -> bbox)
  • [Table 2] GAP : not trained on a single annotated bbox but outperforms than others (NIN, Backprop)
  • [Table 3] Weakly vs Fully-supervised methods
    • bbox selection strategy (heuristics) : 2 bbox (one tight and one loose) from 1st and 2nd predicted classes + 1 loose bbox for top 3rd predicted class
    • weakly-supervised GoogLeNet-GAP (heuristics) ~= fully-supervised AlexNet
    • Same model -> still long way...

4. Deep Features for Generic Localization

image

  • Response from higher-level layers of CNN : effective generic features with SOTA on many image datasets
  • Response from GAP CNN : also perform well as generic features + highlight discriminative regions (without training)
    • GoogLeNet-GAP, GoogLeNet > AlexNet
    • GoogLeNet-GAP ~= GoogLeNet

4.1 Find-grained Recognition

image

  • Dataset : CUB-200-2011 (200 bird species)
  • Methods : GoogLeNet-GAP on full image < crop < bbox

4.2 Pattern Discovery

  • To identify common elements or patterns such as text or high-level concepts

(1) Discovering informative objects in the scenes

image

  • Dataset : 10 scene categories from SUN dataset
  • top 6 objects that most frequently overlap with high activation regions for two scene

(2) Concept localization in weakly labeled images

image

  • concept detector : localize informative regions for concepts, even phrases are more abstract than object names

(3) Weakly supervised text detector

image

  • Dataset : 350 Google StreetView images containing text from SVT dataset
  • highlight text without using bbox annotations

(4) Interpreting visual question answering (VQA)

image

  • overall acc : 55.89%
  • highlight image regions relevant to predicted answers

5. Visualizing Class-Specific Units

image

  • Using GAP and the ranked softmax weight
  • CAM : Visualize most discriminative units (Class-Specific Units) for a given class
  • Combination of Class-Specific Units guides CNN -> we can infer CNN actually learn!

6. Conclusion

  • CAM enables classification-trained CNNs with GAP to perform object localization without bbox annotations
  • CAM visualizes predicted class scores & highlights discriminative object parts
  • CAM generalizes to other visual recognition tasks

Code

def generate_cam(img_tensor, model, class_index, last_conv):
  
    model_input = model.input
    model_output = model.layers[-1].output

    # f_k(x, y) : 마지막 conv layer의 출력 feature map
    f_k = model.get_layer(last_conv).output
    get_output = K.function([model_input], [f_k])
    [last_conv_output] = get_output([img_tensor])

    # batch size가 포함되어 shape가 (1, width, height, k)이므로 (width, height, k)로 shape 변경
    last_conv_output = last_conv_output[0]

    # softmax(+ dense) layer와 GAP layer 사이의 weight matrix에서 class_index에 해당하는 class_weight_k(w^c_k)
    # ex) w^2_1, w^2_2, w^2_3, ..., w^2_k
    class_weight_k = model.layers[-1].get_weights()[0][:, class_index]


    # feature map(last_conv_output)의 (width, height)로 초기화
    cam = np.zeros(dtype=np.float32, shape=last_conv_output.shape[0:2])

   # last conv layer의 출력 feature map(last_conv_output)과 class_weight_k(w^c_k)로 weighted sum을 구함
    for k, w in enumerate(class_weight_k):
        cam += w * last_conv_output[:, :, k]

    return cam