This project demonstrates use of Fully Convolutional Network (FCN) for semantic segmentation (label the pixels of a road). It follows the concepts published in paper Fully Convolutional Networks for Semantic Segmentation by Jonathan Long, Evan Shelhamer, and Trevor Darrell. The model is explained well in this video.
For encoder we use VGG-16 model pre-trained on Imagenet classification.
- We take the output of last FC-4096 layer (we drop FC-1000 layer) and perform a 1x1 convolution followed by a transpose convolution of 4x4 with 2x2 strides.
- To increase clarity we add skip layers by taking output of maxpool layer4 and perform a 1x1 convolution then add this layer to the previous output. Then we perform another transpose convolution of 4x4 with 2x2 strides to upsample further.
- We repeat this process once again by taking output of maxpool layer3 and performing a 1x1 convolution, then adding it with previous output. We then perform final transpose convolution of 16x16 and stride of 4x4.
- This gives us output which has the same size as the input image but each layer corresponds to each class (thus we get a mask defining pixels in each layer). (all these decoding layers are added in layers module).
We trained the model using Kitti Road dataset (see below). The model was trained for 50 epochs and batch size of 5, on a Macbook Pro for 30+ hours with learning rate of .0009
As you can see after 44 epochs model was not gaining much.
The following 4 images were randomly chosen from a test dataset.
I applied the model on a video taken by my dashcam. It is not perfect in indentifying the road but does ok for short amount ot training.
Download the Kitti Road dataset from here. Extract the dataset in the data
folder. This will create the folder data_road
with all the training a test images.
Checkout other techniques of Semantic Segmentation outlined in this blog.