[CV_Segmentation] Multi-scale context aggregation by dilated convolutions
jeonggg119 opened this issue · 0 comments
jeonggg119 commented
Multi-scale context aggregation by dilated convolutions
1. INTRODUCTION
- Semantic segmentation requires combining pixel-level acc with multi-scale contextual reasoning
- Structural differences between image classification and dense prediction
dense prediction : 이미지의 각 픽셀에 대한 레이블을 예측
- Repurposed networks : necessary? reduced accuracy when operated densely?
- Modern classification networks
- Integrating multi-scale contextual information via successive pooling and subsampling → reduce resolution
- BUT dense prediction needs full-resolution output
- Demand of multi-scale reasoning and full-resolution
- repeated up-convolutions : need severe intermediate downsampling → necessary?
- combination predictions of multiple rescaled inputs : separated analysis of input → necessary?
- Dilated convolutions : conv module designed for dense prediction (semantic segmentation)
- multi-scale contextual information without losing resolution
- plugged into existing architectures at any resolution
- no pooling or subsampling
- exponential expansion of receptive field without losing resolution or coverage
- accuracy of sota semantic segmentation ↑
2. Dilated convolutions
- Dilated convolution (*l) can apply same filter at different ranges using different dilation factors (l)
- F_(i+1) = F_i (*2^i) k_i for i = 0,1,...,n-2
- F : discrete functions, k : discrete 3x3 filters
- Size of receptive field of each element in F_(i+1) = [ 2^(i+2) -1 ] X [ 2^(i+2) -1 ] : square of exponentially increasing size
- (a) F_1 : 3x3, (b) F_2 : 7x7, (c) F_3 : 15x15 receptive field
- non-red field = zero value
3. Multi-scale context aggregation
[ Context module ]
- Input, Output
- C feature maps → C feature maps : can maintain resolution
- Same form : can be plugged into any dense prediction architecture
- Each layer has C channels
- directly obtain dense per-class prediction
- feature maps are not normalized, no loss is defined
- Multiple layers that expose contextual information → increase acc
[ Basic Context module ]
- 7 layers : 3x3xC conv with different dilation factors (1,1,2,4,8,16,1)
- A final layer : 1x1xC conv → produce output of the module
- Front end module output feature map : 64x64 resolution → stop expansion after layer 6
- Identity Initialization : set all filters s.t each layer simply passes input directly to the next
- Result : increase dense prediction acc both quantitatively and qualitatively & small # of parameters (total: 64C^2)
4. Front End
[ Front End module ] : Backbone module of Context module
- Input : reflection padded color image → Output : 64x64xC feature maps
- remove last 2 pooling and striding layers of VGG-16 → replace convolution layers were dilated by a factor of 2 for each layer
- remove padding of intermediate feature maps
- Training
- Pascal VOC 2012 training set + subset of annotations of validation set
- SGD, batch size = 14, lr = 10^-3, momentum = 0.9, iterations = 60K
- Test result : front end is both simpler and more accurate
5. Experiments
- Implementation : based on Caffe library
- Dataset : Microsoft COCO with VOC-2012 categories
- Training : 2 stage
- 1st : VOC-2012 & COCO : SGD, batch size = 14, momentum = 0.9, iterations = 100K (lr = 10^-3) + 40K (lr = 10^-4)
- 2nd : fine-tuned network on VOC-2012 only : iterations = 50K (lr = 10^-5)
- Test result
- Front-end module (alone) : 69.8% mean IoU on val set, 71.3% on test set
- Attribution : high acc by removal of vestigial components for image classification
(1) Controlled evaluation of context aggregation
- context module and structured prediction are synergistic → increase accuracy in each configuration
- large context module increases acc by a larger margin
(2) Evaluation on the test set
- large context module : significant boost in acc over front end
- Context module + CRF-RNN = highest acc
CRF-RNN (Conditional Random Field RNN) : post-processing step to get more fine-grained segmentation results in end to end manner
6. Conclusion
- Dilated convolution : dense prediction + increasing receptive field without losing resolution + increasing acc
- Future arch : end-to-end -> removing the need for pre-training -> raw input, dense label at full resolution output