Segmentation can be viewed as pixel classification, whereas for each pixel of image we must predict its class (background being one of the classes). There are two main segmentation algorithms:
-
Semantic segmentation
only tells the pixel class, and does not make a distinction between different objects of the same class. -
Instance segmentation
divides classes into different instances.
For instance segmentation, these sheep are different objects, but for semantic segmentation all sheep are represented by one class.
There are different neural architectures for segmentation, but they all have the same structure. In a way, it is similar to the autoencoder you learned about previously, but instead of deconstructing the original image, our goal is to deconstruct a mask. Thus, a segmentation network has the following parts:
-
Encoder
extracts features from input image -
Decoder
transforms those features into the mask image, with the same size and number of channels corresponding to the number of classes.
Simple encoder - decoder architecture with convolutions, poolings in encoder and convolutions, upsamplings in decoder.
Very simple architecture that uses skip connections. Skip connections at each convolution level helps network doesn't lost information about features from original input at this level.
U-Net usually has a default encoder for feature extraction, for example resnet50.