/CNN_Image_Annotaion

Implementation of some basic Image Annotation methods (using various loss functions & threshold optimization) on Corel-5k dataset with PyTorch library

Primary LanguagePython

CNN_Image_Annotaion

Implementation of some basic Image Annotation methods (using various loss functions & threshold optimization) on Corel-5k dataset with PyTorch library

Dataset

There is a 'Corel-5k' folder that contains the (Corel-5k) dataset with 5000 real images (and 13,500 fake images) from 50 categories. Each category includes 100 images, and there are 260 labels totally in the vocabulary. This is one of the benchmark datasets in the image annotation field, which the low number of data and diversity of labels' distribution makes it challenging. Additionally, There are also datasets with more image data, such as (IAPR TC-12) with 19,627 photos and (ESP-GAME) with 20,770 photos that are commonly used in this field.
Usually, (Corel-5k) is divided into 3 parts: a training set of 4000 images, a validation set of 500 images, and a test set of 499 images. In other words, the total number of images for training is 4500 (18,000 wih fake images) and for validation is 499. (After downloading the Corel-5k, replace its 'images' folder with the corresponding 'images' folder in the 'Corel-5k' folder).

You can see the distribution of some labels below: (total 5000 images)

class name count
sails 2
orchid 2
butterfly 4
cave 6
... ...
cars 151
flowers 296
grass 497
tree 947
sky 988
water 1120

Data Augmentation

As I mentioned previously, (Corel-5k) has only 4500 images as training data, which makes it impossible to train with complicated models and results in overfitting. To overcome this issue, data augmentation methods could be effective. As mentioned in the paper of Xiao Ke, et al, generative adversarial networks are one of the best models for data augmentation.
They proposed a multi-label data augmentation method based on Wasserstein-GAN. (The process of ML-WGAN is shown in the picture below). Due to the nature of multi-label images, every two images in a common dataset usually have different numbers and types of labels, therefore, it is impossible to directly use WGAN for multi-label data augmentation. The paper suggested using only one multi-label image at a time since the noise (z) input by the generator can only be approximated by the distribution of that image iteratively. As the generated images only use one original image as the real data distribution, they all have the same number and type of labels and have their own local differences while the overall distributions are similar.
There is a 'DataAugmentation' folder that contains the codes of ML-WGAN, which is similar to the paper "Improved Training of Wasserstein GANs". because of the fact that one original image has to be used as real data distribution, I trained the network for each image individually and generated 3 more images for every original image, which increased the size of the training images to 18,000.

An example of the generated images: example

ML-WGAN WGAN

Generator generator

Critic discriminator

Convolutional models

Various convolutional models have been used in diverse tasks, but I chose ResNet, ResNeXt, Xception, and TResNet, which have shown good results in recent studies. Due to the pre-training of all of these models, there is no need to initialize weights. By comparing the results obtained from the four mentioned models, I will determine which model is the most effective.

The images below show the structure of these models: info

Xception Xception number of trainable parameters: 21,339,692

ResNeXt50 ResNeXt50 number of trainable parameters: 23,512,644

TResNet-m TResNet_m number of trainable parameters: 29,872,772

ResNet101 ResNet101 number of trainable parameters: 43,032,900

Evaluation Metrics

Precision, Recall, and F1-score are the most popular metrics for evaluating CNN models in image annotation tasks. I've used per-class (per-label) and per-image (overall) precision, recall, and f1-score, which are common in image annotation papers.

The aforementioned evaluation metrics formulas can be seen below: evaluation-metrics

Another evaluation metric used for datasets with large numbers of tags is N+: N-plus

Note that the per-class measures treat all classes equal regardless of their sample size, so one can obtain a high performance by focusing on getting rare classes right. To compensate this, I also measure overall precision/recall which treats all samples equal regardless of their classes.

Train and Evaluation

To train models in Spyder IDE use the code below:

run main.py --model {select model} --loss-function {select loss function}

Please note that:

  1. You should put ResNet101, ResNeXt50, Xception or TResNet in {select model}.

  2. You should put BCELoss, FocalLoss, AsymmetricLoss or LSEPLoss in {select loss function}.

Using augmented data, you can train models as follows:

run main.py --model {select model} --loss-function {select loss function} --augmentation

To evaluate the model in Spyder IDE use the code below:

run main.py --model {select model} --loss-function {select loss function} --evaluate

Loss Functions & Thresholding

I've used several loss functions and thresholding methods to compare their results on the models mentioned above. Classifications with multi-label are typically converted into multiple binary classifications. Based on the number of labels, models predict the logits 𝑥_𝑖 of the 𝑖-th label independently, then the probabilities are given by normalizing the logits with the sigmoid function as 𝑝_𝑖 = 𝜎(𝑥_𝑖). Let 𝑦_𝑖 denote the ground-truth for the 𝑖-th label. (Logits are interpreted to be the not-yet normalized outputs of a model).

The binary classification loss is generally shown in the image below: binary-classification-loss

1: binary cross entropy loss + (fixed threshold = 0.5)

The binary cross entropy (BCE) loss function is one of the most popular loss functions in multi-label classification or image annotation, which is defined as follows for the 𝑖-th label: BCE

results

best model global-pooling batch-size num of training images image-size epoch time
TResNet-m avg 32 4500 448 * 448 135s
data precision recall f1-score
testset per-image metrics 0.726 0.589 0.650
testset per-class metrics 0.453 0.385 0.416
data N+
testset 147

threshold optimization with matthews correlation coefficient

The parameters of the convolutional network will be fixed when the training is complete, then we calculate MCC separately for each label of training data with these thresholds: [0.05 - 0.1 - 0.15 - 0.2 - 0.25 - 0.3 - 0.35 - 0.4 - 0.45 - 0.5 - 0.55 - 0.6 - 0.65 - 0.7]. Finally, the threshold that results in the best MCC will be selected for that label.

The following picture illustrates the MCC formula: MCC

Matthews correlation coefficient calculates the correlation between the actual and predicted labels, which produces a number between -1 and 1. Hence, it will only produce a good score if the model is accurate in all confusion matrix components. MCC is the most robust metric against imbalanced dataset issues.

results

best model global-pooling batch-size num of training images image-size epoch time
TResNet-m avg 32 4500 448 * 448 135s
data precision recall f1-score
testset per-image metrics 0.726 0.589 0.650
testset per-class metrics 0.453 0.385 0.416
testset per-class metrics + MCC 0.445 0.451 0.448
data N+
testset 147
testset + MCC 164

2: focal loss

The difference betwen Focal Loss and BCE loss is that Focal Loss makes it easier for the model to predict labels without being 80-100% sure that this label is present. In simple words, giving the model a bit more freedom to take some risks when making predictions. This is particularly important when dealing with highly imbalanced datasets.
BCE loss leads to overconfidence in the convolutional model, which makes it difficult for the model to generalize. In fact, BCE loss is low when the model is absolutely sure (more than 80% or 90%) about the presence and absence of the labels. However, as seen in the following picture when the model predicts a probability of 60% or 70%, the loss is lower than BCE.

focalloss-pos

The focal loss formula for the 𝑖-th label is shown in the image below: focalloss

To reduce the impact of easy negatives on multi-label training, we use focal loss. However, setting high 𝛾 may eliminate the gradients from rare positive labels. As a result, we cannot expect a higher recall if we increase the value of 𝛾. I will elaborate on this issue in the Gradient Analysis section.

results

best model global-pooling batch-size num of training images image-size epoch time 𝛾
TResNet-m avg 32 4500 448 * 448 135s 3
data precision recall f1-score
testset per-image metrics 0.758 0.581 0.658
testset per-class metrics 0.452 0.366 0.405
testset per-class metrics + MCC 0.483 0.451 0.466
data N+
testset 139
testset + MCC 162

3: asymmetric loss

I mentioned here that the distribution of labels in the (Corel-5k) and other annotation datasets is extremely unbalanced. The training set might contain labels that appear only once, as well as labels that appear more than 1,000 times. Unfortunately, due to the nature of annotation datasets, there isn't anything that can be done to overcome this problem.
But, there is another imbalance regarding the number of positive and negative labels in a picture. In simple words, most multi-label pictures contain fewer positive labels than negative ones (for example, each image in the (Corel-5k) dataset contains on average 3.4 positive labels).

imbalance labels

In training, this imbalance between positive and negative labels dominates the optimization process, which results in a weak emphasis on gradients from positive labels (more information in the Gradient Analysis section). Asymmetric loss operates differently on positive and negative labels. It has two main parts:
1. Asymmetric Focusing
Unlike the focal loss, which considers one 𝛾 for positive and negative labels, positive and negative labels can be decoupled by taking 𝛾+ as the focusing level for positive labels, and 𝛾− as the focusing level for negative labels. Due to the fact that we are seeking to emphasize the contribution of positive labels, we usually set 𝛾− > 𝛾+.
2. Asymmetric Probability Shifting
Asymmetric focusing reduces the contribution of negative labels to the loss when their probability is low (soft thresholding). However, this attenuation is not always sufficient due to the high level of imbalance in multi-label classifications. Therefore, we can use another asymmetric mechanism, probability shifting, which performs hard thresholding on very low probability negative labels, and discards them completely. The shifted probability is defined as 𝑝_𝑚 = max⁡(𝑝 − 𝑚, 0), where the probability margin 𝑚 ≥ 0 is a tunable hyperparameter.

In the image below, the asymmetric loss formula for the 𝑖-th label can be seen: asymmetricloss

results

best model global-pooling batch-size num of training images image-size epoch time 𝛾+ y- m
TResNet-m avg 32 4500 448 * 448 141s 0 4 0.05
data precision recall f1-score
testset per-image metrics 0.624 0.688 0.654
testset per-class metrics 0.480 0.522 0.500
testset per-class metrics + MCC 0.473 0.535 0.502
data N+
testset 179
testset + MCC 184

4: log-sum-exponential-pairwise loss

LSEP loss had been proposed in the paper of Y. Li, et al, and was an improvement for the simple pairwise ranking loss function. In fact, LSEP is differentiable and smooth everywhere, which makes it easier to optimize.

LSEP

results

best model global-pooling batch-size num of training images image-size epoch time
ResNeXt50 avg 32 4500 224 * 224 45s
data precision recall f1-score
testset per-image metrics 0.490 0.720 0.583
testset per-class metrics 0.403 0.548 0.464
data N+
testset 188

The result of the trained model with LSEP loss on one batch of test data: lsep_results

Gradient Analysis

Our goal in this section is to analyze and compare gradients of different losses in order to gain a better understanding of their properties and behavior. Since weights of the network are updated according to the gradient of the loss function based on the input logit 𝑥, it is beneficial to look at these gradients.
The BCE loss gradients for positive labels and negative are as follows::

BCE

As I mentioned before, the massive imbalance between positive and negative labels in an image affects the optimization process. And due to its symmetrical nature (every positive and negative label contributes equally), BCE cannot overcome this problem. In the image above, the red line indicates that BCE loss always has a gradient greater than zero, even for those negative labels whose probability (p) is close to zero. The high number of negative labels present in annotation problems causes gradients of positive labels to be underemphasized during training.
To resolve this issue, we can either reduce the contribution of negative labels from the loss or increase the contribution of positive labels from the loss.

Another symmetric loss function that tries to down-weight the contribution from easy negatives is focal loss. you can see its gradients for positive and negative labels in the picture below:

focal

According to the image above, the focal loss has successfully reduced the contribution from easy negatives (the loss gradients for low probability negatives are near zero). But due to its symmetrical nature, focal loss also eliminates the gradients from the rare positive labels.
The loss gradient for positive labels indicates that it only pushes a very small proportion of hard positives to a high probability and ignores a large ratio of semi-hard ones with a medium probability.
The contribution of easy negative labels would decrease more when 𝛾 is increased, but on the other hand gradients of more positive labels would disappear.
In order to overcome the problems associated with BCE loss and focal loss, the asymmetric loss is one of the best solutions. The reason why it has a significant effect on the result can be seen in the illustration of loss gradients for positive and negative labels below:

AS

One of our objectives was to reduce the contribution of negative labels from the loss, but symmetric loss functions such as focal loss could not keep the contribution of positives at the same time as reducing the contribution of negatives. However, by choosing 𝛾− and 𝛾+ differently, the objective can be achieved as shown in the image above. Furthermore, the loss gradients for negative labels indicate that hard thresholding (m) not only causes a very low probability of negative labels being ignored completely but also affects the very hard ones, which are considered missing labels. As a definition of missing labels, we can say that if the network calculates a very high probability for a negative label, then it might be positive.
It is found that the loss gradients of negative labels with a large probability (p > 0.9) are very low, indicating that they can be accepted as missing labels.

Conclusions

To sum up, different types of convolutional models and loss functions gave me different results. Between convolutional models, TResNet performed better than the other models not only in the result but also in memory usage. Based on the results, LSEP loss leads to an increase in recall value, so the model predicts more labels per image. However, BCE and focal loss increase precision, so the model is more cautious in predicting labels, and will try to predict more probable labels. The best results from both f1-score and N+ were obtained by the model which optimized by asymmetric loss function, and this shows the superiority of this type of loss function compared to other loss functions.
In order to compare the results, I have tried many experiments including changing the resolution of the images (from 224 * 224 to 448 * 448), changing the global pooling of the convolutional models (from global average pooling to global maximum pooling), etc. Among these experiments, the aforementioned results are the best.
Unfortunately, the data augmentation method (ML-WGAN) did not produce the expected results, and could not improve the overfitting problem.
In this project, I used a variety of CNNs and loss functions without taking label correlations into account. By using methods such as graph convolutional networks (GCNs) or recurrent neural networks (RNNs) that consider the semantic relationships between labels, better results may be obtained.

References

Y. Li, Y. Song, and J. Luo.
"Improving Pairwise Ranking for Multi-label Image Classification" (CVPR - 2017)

T. Ridnik, E. Ben-Baruch, N. Zamir, A. Noy, I. Friedman, M. Protter, and L. Zelnik-Manor.
"Asymmetric Loss For Multi-Label Classification" (ICCV - 2021)

I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. Courville.
"Improved Training of Wasserstein GANs" (arXiv - 2017)