DeepLab with PyTorch

This is an unofficial PyTorch implementation of DeepLab v2 [1] with a ResNet-101 backbone. COCO-Stuff dataset [2] and PASCAL VOC dataset [3] are supported. The initial weights (.caffemodel) officially provided by the authors are can be converted/used without building the Caffe API. DeepLab v3/v3+ models with the identical backbone are also included (although not tested). torch.hub is supported.

Performance

Pretrained models are provided for each training set. Note that the 2D interpolation ways are different from the original, which leads to a bit better results.

COCO-Stuff

Train set	Eval set	CRF?	Code	Pixel Accuracy	Mean Accuracy	Mean IoU	FreqW IoU
10k train † (Model)	10k val †		Original [2]	65.1	45.5	34.4	50.4
			Ours	65.8	45.7	34.8	51.2
		✓	Ours	67.1	46.4	35.6	52.5
164k train (Model)	10k val		Ours	68.4	55.6	44.2	55.1
	10k val	✓	Ours	69.2	55.9	45.0	55.9
	164k val		Ours	66.8	51.2	39.1	51.5
	164k val	✓	Ours	67.6	51.5	39.7	52.3

† Images and labels are pre-warped to square-shape 513x513

PASCAL VOC 2012

Train set	Eval set	CRF?	Code	Pixel Accuracy	Mean Accuracy	Mean IoU	FreqW IoU
trainaug (Model)	val		Original [3]	-	-	76.35	-
			Ours	94.64	86.50	76.65	90.41
		✓	Original [3]	-	-	77.69	-
		✓	Ours	95.04	86.64	77.93	91.06

Setup

Requirements

Python 2.7+/3.6+
Anaconda environement

Then setup from conda_env.yaml. Please modify cuda option as needed (default: cudatoolkit=10.0)

$ conda env create -f configs/conda_env.yaml
$ conda activate deeplab-pytorch

Datasets

Setup instruction is provided in each link.

Initial weights

Run the script below to download caffemodel pre-trained on ImageNet and 91-class COCO (1GB+).

$ bash scripts/setup_caffemodels.sh

Convert the caffemodel to pytorch compatible. No need to build the official Caffe API!

# This generates "deeplabv1_resnet101-coco.pth" from "init.caffemodel"
$ python convert.py --dataset coco

Training

Please see ./scripts/train_eval.sh for example usage.

Usage: main.py train [OPTIONS]

  Training DeepLab by v2 protocol

Options:
  -c, --config-path FILENAME  Dataset configuration file in YAML  [required]
  --cuda / --cpu              Enable CUDA if available [default: --cuda]
  --help                      Show this message and exit.

To monitor a loss, lr values, and gpu usage:

$ tensorboard --logdir data/logs

Common settings:

Model: DeepLab v2 with ResNet-101 backbone. Dilated rates of ASPP are (6, 12, 18, 24). Output stride is 8.
Multi-GPU: All the GPUs visible to the process are used. Please specify the scope with CUDA_VISIBLE_DEVICES=.
Multi-scale loss: Loss is defined as a sum of responses from multi-scale inputs (1x, 0.75x, 0.5x) and element-wise max across the scales. The unlabeled class is ignored in the loss computation.
Gradient accumulation: The mini-batch of 10 samples is not processed at once due to the high occupancy of GPU memories. Instead, gradients of small batches of 5 samples are accumulated for 2 iterations, and weight updating is performed at the end (batch_size * iter_size = 10). GPU memory usage is approx. 11.2 GB with the default setting (tested on the single Titan X). You can reduce it with a small batch_size.
Learning rate: Stochastic gradient descent (SGD) is used with momentum of 0.9 and initial learning rate of 2.5e-4. Polynomial learning rate decay is employed; the learning rate is multiplied by (1-iter/iter_max)**power at every 10 iterations.
Monitoring: Moving average loss (average_loss in Caffe) can be monitored in TensorBoard.

COCO-Stuff 164k:

#Iterations: Updated 100k iterations.
#Classes: The label indices range from 0 to 181 and the model outputs a 182-dim categorical distribution, but only 171 classes are supervised with COCO-Stuff. Index 255 is an unlabeled class to be ignored.
Preprocessing: (1) Input images are randomly re-scaled by factors ranging from 0.5 to 1.5, (2) padded if needed, and (3) randomly cropped to 321x321.

COCO-Stuff 10k:

#Iterations: Updated 20k iterations.
#Classes: Same as the 164k version above.
Preprocessing: (1) Input images are initially warped to 513x513 squares, (2) randomly re-scaled by factors ranging from 0.5 to 1.5, (3) padded if needed, and (4) randomly cropped to 321x321 so that the input size is fixed during training.

PASCAL VOC 2012:

#Iterations: Updated 20k iterations.
#Classes: 20 foreground objects + 1 background. Index 255 is an unlabeled class to be ignored.
Preprocessing: (1) Input images are randomly re-scaled by factors ranging from 0.5 to 1.5, (2) padded if needed, and (3) randomly cropped to 321x321.

Processed image vs. label examples in COCO-Stuff:

Evaluation

To compute scores in:

Pixel accuracy
Mean accuracy
Mean IoU
Frequency weighted IoU

Usage: main.py test [OPTIONS]

  Evaluation on validation set

Options:
  -c, --config-path FILENAME  Dataset configuration file in YAML  [required]
  -m, --model-path PATH       PyTorch model to be loaded  [required]
  --cuda / --cpu              Enable CUDA if available [default: --cuda]
  --help                      Show this message and exit.

To perform CRF post-processing:

Usage: main.py crf [OPTIONS]

  CRF post-processing on pre-computed logits

Options:
  -c, --config-path FILENAME  Dataset configuration file in YAML  [required]
  -j, --n-jobs INTEGER        Number of parallel jobs  [default: # cores]
  --help                      Show this message and exit.

Demo

COCO-Stuff 164k	COCO-Stuff 10k	PASCAL VOC 2012	Pretrained COCO

Single image

Usage: demo.py single [OPTIONS]

  Inference from a single image

Options:
  -c, --config-path FILENAME  Dataset configuration file in YAML  [required]
  -m, --model-path PATH       PyTorch model to be loaded  [required]
  -i, --image-path PATH       Image to be processed  [required]
  --cuda / --cpu              Enable CUDA if available [default: --cuda]
  --crf                       CRF post-processing  [default: False]
  --help                      Show this message and exit.

Webcam

A class of mouseovered pixel is shown in terminal.

Usage: demo.py live [OPTIONS]

  Inference from camera stream

Options:
  -c, --config-path FILENAME  Dataset configuration file in YAML  [required]
  -m, --model-path PATH       PyTorch model to be loaded  [required]
  --cuda / --cpu              Enable CUDA if available [default: --cuda]
  --crf                       CRF post-processing  [default: False]
  --camera-id INTEGER         Device ID  [default: 0]
  --help                      Show this message and exit.

torch.hub

Model setup with 3 lines.

import torch.hub
model = torch.hub.load("kazuto1011/deeplab-pytorch", "deeplabv2_resnet101", n_classes=182)
model.load_state_dict(torch.load("deeplabv2_resnet101_msc-cocostuff164k-100000.pth"))

Misc

Difference with Caffe version

While the official code employs 1/16 bilinear interpolation (Interp layer) for downsampling a label for only 0.5x input, this codebase does for both 0.5x and 0.75x inputs with nearest interpolation (PIL.Image.resize, related issue).
Bilinear interpolation on images and logits is performed with the align_corners=False.

Training batch normalization

This codebase only supports DeepLab v2 training which freezes batch normalization layers, although v3/v3+ protocols require training them. If training their parameters on multiple GPUs as well in your projects, please install the extra library below.

pip install torch-encoding

Batch normalization layers in a model are automatically switched in libs/models/resnet.py.

try:
    from encoding.nn import SyncBatchNorm
    _BATCH_NORM = SyncBatchNorm
except:
    _BATCH_NORM = nn.BatchNorm2d

References

L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, A. L. Yuille. DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs. IEEE TPAMI, 2018.
Project / Code / arXiv paper
H. Caesar, J. Uijlings, V. Ferrari. COCO-Stuff: Thing and Stuff Classes in Context. In CVPR, 2018.
Project / arXiv paper
M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, A. Zisserman. The PASCAL Visual Object Classes (VOC) Challenge. IJCV, 2010.
Project / Paper

AkoZ/deeplab-pytorch