[CV_Segmentation] Multi-scale context aggregation by dilated convolutions

Question

jeonggg119 opened this issue 3 years ago · 0 comments

Multi-scale context aggregation by dilated convolutions

1. INTRODUCTION

Semantic segmentation requires combining pixel-level acc with multi-scale contextual reasoning
Structural differences between image classification and dense prediction

dense prediction : 이미지의 각 픽셀에 대한 레이블을 예측

Repurposed networks : necessary? reduced accuracy when operated densely?
Modern classification networks
- Integrating multi-scale contextual information via successive pooling and subsampling → reduce resolution
- BUT dense prediction needs full-resolution output
Demand of multi-scale reasoning and full-resolution
- repeated up-convolutions : need severe intermediate downsampling → necessary?
- combination predictions of multiple rescaled inputs : separated analysis of input → necessary?
Dilated convolutions : conv module designed for dense prediction (semantic segmentation)
- multi-scale contextual information without losing resolution
- plugged into existing architectures at any resolution
- no pooling or subsampling
- exponential expansion of receptive field without losing resolution or coverage
- accuracy of sota semantic segmentation ↑

Dilated convolution (*l) can apply same filter at different ranges using different dilation factors (l)
F_(i+1) = F_i (*2^i) k_i for i = 0,1,...,n-2
F : discrete functions, k : discrete 3x3 filters
Size of receptive field of each element in F_(i+1) = [ 2^(i+2) -1 ] X [ 2^(i+2) -1 ] : square of exponentially increasing size
- (a) F_1 : 3x3, (b) F_2 : 7x7, (c) F_3 : 15x15 receptive field
- non-red field = zero value

[ Context module ]

Input, Output
- C feature maps → C feature maps : can maintain resolution
- Same form : can be plugged into any dense prediction architecture
Each layer has C channels
- directly obtain dense per-class prediction
- feature maps are not normalized, no loss is defined
Multiple layers that expose contextual information → increase acc

[ Basic Context module ]

7 layers : 3x3xC conv with different dilation factors (1,1,2,4,8,16,1)
A final layer : 1x1xC conv → produce output of the module
Front end module output feature map : 64x64 resolution → stop expansion after layer 6
Identity Initialization : set all filters s.t each layer simply passes input directly to the next
Result : increase dense prediction acc both quantitatively and qualitatively & small # of parameters (total: 64C^2)

[ Front End module ] : Backbone module of Context module

Input : reflection padded color image → Output : 64x64xC feature maps
remove last 2 pooling and striding layers of VGG-16 → replace convolution layers were dilated by a factor of 2 for each layer
remove padding of intermediate feature maps

Training
- Pascal VOC 2012 training set + subset of annotations of validation set
- SGD, batch size = 14, lr = 10^-3, momentum = 0.9, iterations = 60K
Test result : front end is both simpler and more accurate

Implementation : based on Caffe library
Dataset : Microsoft COCO with VOC-2012 categories
Training : 2 stage
- 1st : VOC-2012 & COCO : SGD, batch size = 14, momentum = 0.9, iterations = 100K (lr = 10^-3) + 40K (lr = 10^-4)
- 2nd : fine-tuned network on VOC-2012 only : iterations = 50K (lr = 10^-5)
Test result
- Front-end module (alone) : 69.8% mean IoU on val set, 71.3% on test set
- Attribution : high acc by removal of vestigial components for image classification

context module and structured prediction are synergistic → increase accuracy in each configuration
large context module increases acc by a larger margin

CRF-RNN (Conditional Random Field RNN) : post-processing step to get more fine-grained segmentation results in end to end manner

Dilated convolution : dense prediction + increasing receptive field without losing resolution + increasing acc
Future arch : end-to-end -> removing the need for pre-training -> raw input, dense label at full resolution output