jeonggg119/DL_paper

[CV_Segmentation] Multi-scale context aggregation by dilated convolutions

jeonggg119 opened this issue · 0 comments

Multi-scale context aggregation by dilated convolutions

1. INTRODUCTION

  • Semantic segmentation requires combining pixel-level acc with multi-scale contextual reasoning
  • Structural differences between image classification and dense prediction

dense prediction : 이미지의 각 픽셀에 대한 레이블을 예측

  • Repurposed networks : necessary? reduced accuracy when operated densely?
  • Modern classification networks
    • Integrating multi-scale contextual information via successive pooling and subsampling → reduce resolution
    • BUT dense prediction needs full-resolution output
  • Demand of multi-scale reasoning and full-resolution
    • repeated up-convolutions : need severe intermediate downsampling → necessary?
    • combination predictions of multiple rescaled inputs : separated analysis of input → necessary?
  • Dilated convolutions : conv module designed for dense prediction (semantic segmentation)
    • multi-scale contextual information without losing resolution
    • plugged into existing architectures at any resolution
    • no pooling or subsampling
    • exponential expansion of receptive field without losing resolution or coverage
    • accuracy of sota semantic segmentation ↑

2. Dilated convolutions

image

  • Dilated convolution (*l) can apply same filter at different ranges using different dilation factors (l)
  • F_(i+1) = F_i (*2^i) k_i for i = 0,1,...,n-2
  • F : discrete functions, k : discrete 3x3 filters
  • Size of receptive field of each element in F_(i+1) = [ 2^(i+2) -1 ] X [ 2^(i+2) -1 ] : square of exponentially increasing size
    • (a) F_1 : 3x3, (b) F_2 : 7x7, (c) F_3 : 15x15 receptive field
    • non-red field = zero value

3. Multi-scale context aggregation

image

[ Context module ]

  • Input, Output
    • C feature maps → C feature maps : can maintain resolution
    • Same form : can be plugged into any dense prediction architecture
  • Each layer has C channels
    • directly obtain dense per-class prediction
    • feature maps are not normalized, no loss is defined
  • Multiple layers that expose contextual information → increase acc

[ Basic Context module ]

  • 7 layers : 3x3xC conv with different dilation factors (1,1,2,4,8,16,1)
  • A final layer : 1x1xC conv → produce output of the module
  • Front end module output feature map : 64x64 resolution → stop expansion after layer 6
  • Identity Initialization : set all filters s.t each layer simply passes input directly to the next
  • Result : increase dense prediction acc both quantitatively and qualitatively & small # of parameters (total: 64C^2)

4. Front End

[ Front End module ] : Backbone module of Context module

  • Input : reflection padded color image → Output : 64x64xC feature maps
  • remove last 2 pooling and striding layers of VGG-16 → replace convolution layers were dilated by a factor of 2 for each layer
  • remove padding of intermediate feature maps

image

  • Training
    • Pascal VOC 2012 training set + subset of annotations of validation set
    • SGD, batch size = 14, lr = 10^-3, momentum = 0.9, iterations = 60K
  • Test result : front end is both simpler and more accurate
    image
    image

5. Experiments

  • Implementation : based on Caffe library
  • Dataset : Microsoft COCO with VOC-2012 categories
  • Training : 2 stage
    • 1st : VOC-2012 & COCO : SGD, batch size = 14, momentum = 0.9, iterations = 100K (lr = 10^-3) + 40K (lr = 10^-4)
    • 2nd : fine-tuned network on VOC-2012 only : iterations = 50K (lr = 10^-5)
  • Test result
    • Front-end module (alone) : 69.8% mean IoU on val set, 71.3% on test set
    • Attribution : high acc by removal of vestigial components for image classification

(1) Controlled evaluation of context aggregation

image
image

  • context module and structured prediction are synergistic → increase accuracy in each configuration
  • large context module increases acc by a larger margin

(2) Evaluation on the test set

image

  • large context module : significant boost in acc over front end
  • Context module + CRF-RNN = highest acc

CRF-RNN (Conditional Random Field RNN) : post-processing step to get more fine-grained segmentation results in end to end manner

6. Conclusion

  • Dilated convolution : dense prediction + increasing receptive field without losing resolution + increasing acc
  • Future arch : end-to-end -> removing the need for pre-training -> raw input, dense label at full resolution output