GeoffreyChen777/VC

About the classification loss calculation

Closed this issue · 5 comments

Hi, thanks for your such great work, but I have a little confused about the calculation of classification loss in your code: from my understanding, the logits sent into the focal loss have the shape of [bached_roi_bbox_num, num_classes + 1 + batched_roi_bbox_num], in the last dimension, from 0 to num_class, it is in the original classification label, num_classes means background class, and from the num_class + 1 to num_class + batched_roi_bbox_num, it represents the similarity scores of the batched roi boxes from a different view, if the label is -1 means ignore, so, could you tell me which activation function you used in the focal loss? softmax or sigmoid? for there may be multiple "1" in the last dimension? How did you deal with this situation? and what's the meaning of the calculation CE in the loss function in your code? why do we need plus 1 to the "pos_term * neg_term"?
Hope to get your more detailed explanations for this. Thanks!
image

Hi,

The size of the last dimension is num_classes + 1 + batched_roi_bbox_num. It is just for the convenience of implement.

Let me give you an example to explain the details of this.

Suppose we have a logits of 4 predefined categories, and 4 roi boxes.

Since we use torch.mm to get the virtual logits here:

noise_resistant_logits = torch.mm(self.box_feats, F.normalize(linear_feats).t()) / self.temprature

The logits for one of the roi boxes (e.g., No.2) can be: [0.0, 1.1, 2.2, 3.3, 4.4, 5.5, 6.6, 7.7].

If this roi is a confusing one, we need to use the virtual weight to train it, so here we construct the addtional label matrix for all the boxes:

additional_probs = torch.zeros(
(additional_logits_idxs.shape[0], self.cls_logits.shape[1] - self.num_classes - 1), device=self.device
).fill_(-1)
flag = additional_logits_idxs >= 0
additional_probs[flag].scatter_(dim=1, index=additional_logits_idxs[flag].unsqueeze(1), value=1)
cls_probs = torch.cat((cls_probs, additional_probs), dim=1)

We firstly fill this additional matrix additional_probs with -1, after that, for all the confusing boxes, we put a 1 to their corresponding index. For example, for the No.2 box in the above mentioned example, the additional matrix should be

[
...,
[-1, 1, -1, -1]
,...
]

Then we concatenate it with the original onehot labels. Notablly there will never be more than one 1 in each row. The reason is that we ignore the categories in the PC set here for the confusing boxes:

nagree_flag = torch.ne(label[:, 4], potential_label[:, 4])
prob_matrix = torch.zeros((label.shape[0], self.num_classes + 1), device=self.device)
prob_matrix.scatter_(1, label[:, 4].long().unsqueeze(1), 1)
prob_matrix[nagree_flag] = prob_matrix[nagree_flag].scatter_(
1, label[nagree_flag, 4].long().unsqueeze(1), -1
)
prob_matrix[nagree_flag] = prob_matrix[nagree_flag].scatter_(
1, potential_label[nagree_flag, 4].long().unsqueeze(1), -1
)

Thus, for each row of the final one-hot label, if it is of a confusing box, the label should be some forms like: [0, -1, 0, -1, -1, 1, -1, -1], you only have one 1, some -1, and some 0 for the rest categories.
If it is of a unambigous box, the label should be like: [1, 0, 0, 0, -1, -1, -1, -1]. All the virtual index is ignored.

Then, when we calculate the loss, we actually use the softmax to get the probability. but you cannot find an explicit softmax function, The reason is that we use an equivalent form of the softmax crossentropy, that is:

exp_x = x.exp()
pos_term = (1 / exp_x * target * mask).sum(dim=1)
neg_term = (exp_x * torch.eq(target, 0).float() * mask).sum(dim=1)
CE = (1 + pos_term * neg_term).log()

You can find some details here if you can read Chinese:
https://spaces.ac.cn/archives/7359

If you cannot read Chinese, you can tell me.

BTW, the VC learning code in detection is a little bit ugly because detection is a very complex task. We are working on classification and segmentation in our journal version and we do have a clean version of VC loss that can be used in classification and segmentation. If you are intrested I can share it with you.

Here is a demo to explain the cross entropy equivalent:

import torch
import torch.nn.functional as F


x = torch.randn((1, 4))
target = torch.tensor([1])
onehot_target = torch.tensor([[0, 1, 0, 0],])


exp_x = x.exp()
pos_term = (1 / exp_x * onehot_target).sum(dim=1)
neg_term = (exp_x * torch.eq(onehot_target, 0).float()).sum(dim=1)

CE = (1 + pos_term * neg_term).log()

print(CE)


loss = F.cross_entropy(x, target, reduction="none")
print(loss)

python test.py
tensor([1.0274])
tensor([1.0274])

Sorry, just updated some errors in my first anwser. Please read it again if you are still confused...

Great! Thanks for your explanation! I found that I forgot we have the ignore the original pseudo label operation when the pseudo label is ambiguous. BTW, I think that from the optimization view, this method is very similar to contrastive learning. When the pseudo label is ambiguous, to avoid the wrong optimization, we just keep the feature extracted by the teacher and student as close as possible, am I right? whether there is a situation in which there are two overlapped pseudo labels with ambiguity, and if we just keep the corresponding roi box's feature close and push away of different roi boxes extracted feature, it may cause some conflict, for these roi boxes almost cover the same object?

That's correct that the overall is very similar to the contrastive learning, but is still a little bit different.

As we mentioned in the paper page6 bottom:

Although one may suspect that our approach looks similar to the contrastive learning [15,9] in terms of the optimisation target, they differ in several aspects. Firstly, contrastive learning operates before the task-relevant layer (i.e., the classifier). As a result, it only drives the backbone to extract better features but contributes nothing to the task-relevant layer. While our approach acts after the classifier so that the gradient of virtual category can backpropagate to not only the backbone but also the weight vectors in the classifier. Secondly, the weight vectors of the other categories in the classifier naturally constitute negative samples such that there is no need to maintain a negative sample pool.

Anyway, we are trying to pull the features from teacher and student together. That is quite similar to the contrastive learning.

About your concerns about confilict, for different proposal boxes covered one object, the virtual weight of them is exactly the same one:

for data in strong_data_list_u:
w = data.img.shape[2]
proposal = data.label.clone()
proposal[:, [0, 2]] = w - proposal[:, [2, 0]]
proposals.append({"boxes": proposal})
# 2.2 Extract features
self.ema_detector(self.weak_aug(self.flip(data_list_u)), proposals=proposals)
linear_feat_list = self.linear_feats.split([data.label.shape[0] for data in strong_data_list_u])

We firstly get the teacher features of each pseudo label. When we train the detection head, we have many candidate proposals. We firstly do the matching with the pseudo labels, just like FasterRCNN does. Then we can allocate correspoding virtual weights(teacher features) to each candidate proposal according to the matching result. For those candidate proposals covered one object, the virtual weights are the same one.