In this project, I implemented a Fully Convolutional Network (FCN) on top of the last layer of vgg16 to label the pixels of a road in images.
AWS g3.x4large
Download the Kitti Road dataset from here. Extract the dataset in the data
folder. This will create the folder data_road
with all the training a test images.
I picked a pre-trained VGG-16 network and added a de-convolution layer. The last fully connected layer to is converted to a 1x1 convolution and the depth is set to the number of classes (road and not-road). Performance is improved via two approach:
- Upsample convolved layer 7 (the last layer).
- Scale the output of layer 3 and layer 4 before performing 1x1 convolution on them.
- Skip connections. Scaled and 1x1-convolved layer 4 is added element-wisely to upsampled 1x1-convolved layer 7, and then the sum is further upsampled and added element-wisely to scaled and 1x1-convolved layer 3. Each convolution and transpose convolution layer includes a kernel initializer and L2 regularizer.
The loss function is cross entropy, and the optimizer is Adam optimizer.
The scaling factors for the outputs from layer 3 and layer 4 are 0.0001 and 0.01. Without the scaling factors the training will converge much slower and the result is worse. keep_prob: 0.5 learning_rate: 0.0001 epochs: 50 batch_size: 5
Loss per batch is below 0.200 after one epoch, and below 0.80 after ten epochs. Average loss per batch at epoch 20 is below 0.055, at epoch 30 is below 0.030, at epoch 40 is below 0.020, and at epoch 50 around 0.016.
Below are a few sample images from the output of the fully convolutional network, with the segmentation class overlaid upon the original image in green.