A subpixel dense and deeply supervised U-Net like architecture for segmenting retinal vessels.
- Contributors
- Introduction
- Dataset Used
- Augmentation & Preprocessing
- Network Architecture
- Loss Function & Optimizer
- Training setup
- Evaluation Metric
- Results
This project is created by the joint efforts of
Automatic extraction of retinal vessels from retinal fundas images is a long researched area in AI aided medical diagnosis as it assists in the implementation of screening programs for diabetic retinopathy, to establish a relationship between vessel tortuosity and hypertensive retinopathy, vesssel diameter measurement in relation with diagnosis of hypertension, and computer assisted laser surgery. Also, the retinal vascular tree is found to be unique for each individual and can be used for biometric identification. For all the above mentioned reasons and for many others, automated retinal vessel extraction is a very important task. In this project, we have tried to use a modified U-Net like structure and achieved near state-of-the-art results.
The DRIVE dataset is used here for training as well as testing purposes. The dataset consists of 40 expert annotated retinal fundas images divided equally into a training and test set of 20 image each. Each image was captured using 8 bits per color plane at 768 by 584 pixels. The FOV of each image is circular with a diameter of approximately 540 pixels. For this dataset, the images have been cropped around the FOV. For each image, a mask image is provided that delineates the FOV. A sample fundas image along with its hand annotated mask is provided below.
Original Image | Mask Image | AV Mask |
---|---|---|
The training data was augmented on the fly using the Albumentations library. A strong combination of different types of image augmentations were applied with varied probabilities. They were:
- Random Flips.
- Transpose.
- Scale, Shift & rotate.
- Random Rotations.
- Optical Distortion.
- Grid Distortion.
- Elastic Transform.
- RGB Shift.
- Random Gamma.
- Random Brightness & contrast.
Along with the above mentioned augmentations, every image in the training and testing sets underwent a Histogram Equalization preprocessing step, i.e, CLAHE (Contrast Limited Adaptive Histogram Equalization).
Some examples of augmented images and masks are given below.
Augmented Training Images |
---|
Augmented Training Masks |
---|
-
The proposed architecture shown in Fig. 1, below is inspired from the original U-Net architecure with major modifications. It is a 4 stage segmentation architecture based upon the concept of stacking. In this 4 stage architecture, the feature map from the penultimate layer of the first stage is passed on to the second stage as the penultimate layer contains more information than the last output layer. The passed on feature map is then concatenated with the original input shape (shown in bold black arrows) and is fed as an input to the second stage. Concatenating the original shape image with the previous feature map improves the refinement of the predictions. Similarly, the feature map from the penultimate layer of the second stage is passed on the third stage and concatenated with the original input image as so on until the fourth stage. The final predictions are obtained from the output of the fourth stage.
-
Here the learnable transposed convolution method or the interpolation based up-sampling method is replaced by the efficient subpixel convolutions. A subpixel convolutional layer is shown in Fig. 2, which is just a standard 1x1 convolution layer followed by a pixel shuffling operation which re-arranges the pixels from the depth dimensions to the spatial dimensions. This type pf convolutions minimize information loss during up-sampling of the image in the decoder network.
-
Each stage has its own output layer and the loss is minimized at each stage. The architecture is trained using deep supervision.
-
The parallel layers of the four decoder networks are connected via dense connections which helps in improving the spatial knowledge transfer through each stage of the model while training.
Fig. 1 Network architecture | Fig. 2 Subpixel Convolutional layer |
---|---|
Fig. 3 Encoder Block |
---|
Pixel-wise Binary Cross-entropy is a widely used loss function for semantic segmentation since it evaluates the class predictions for each pixel individually, but it suffers from class imbalance, so we have used another loss function along with BCE loss, named Dice loss which has a normalizing effect and is not affected by class imbalance.
Dice Loss = 2 ×| A ∩ B |/| A ∪ B |
Adam optimizer was used with default parameters
- GPU: Nvidia P100 16GB
- CPU: Intel Xeon
- RAM: 12GB DDR4
The network was trained using the above mentioned setup for 40 epochs with a batch size of 10
and input image size 256 x 256 x 3
. Total time taken for training is 1.5 hours
F1-score was used for evaluation as it takes into account the class imbalance, because the number of vessel pixels greatly outnumbers the non-vessel pixels. It is the harmonic mean of precision and recall.
The table below compares the F1-score, AUC and the IoU of all the stages.
Stage | F1-Score | AUC | IoU | Dice coef. |
---|---|---|---|---|
Stage 1 | 84.0 | 90.0 | 74.0 | -- |
Stage 2 | 85.5 | 91.7 | 74.2 | -- |
Stage 3 | 86.3 | 92.0 | 75.6 | -- |
Stage 4 | 87.5 | 94.7 | 77.2 | -- |
-- | -- | -- | -- | -- |
Average | 87.8 | 92.1 | 75.2 | -- |
The table below compares our method with existing methods.
Model | F1-Score | AUC | IoU | Dice coef. |
---|---|---|---|---|
U-Net | ||||
U-Net++ | ||||
Wide U-Net | ||||
Residual U-Net | ||||
Pretrained VGG encoded U-Net | ||||
Pretrained MobileNetV2 encoded U-Net | ||||
Pretrained ResNet encoded U-Net | ||||
Proposed Method | 87.5 | 94.7 | 77.2 | -- |
The predicted masks from each stage are shown below along with their respective ground truth masks and the original fundus images. Notice that the predictions from stage 4 are the best.
Original Fundus images |
---|
Original Ground truth masks |
---|
Stage 1 Predictions |
---|
Stage 2 predictions |
---|
Stage 3 predictions |
---|
Stage 4 predictions |
---|