The loss value of the teacher model is very large

Question

The loss value of the teacher model is very large

Opened this issue 4 years ago · 6 comments

On the Screw class, the loss values don't converge when I'm running resnet_train.py, and the loss values are particularly large when I'm training the teacher model.Why is that.

Answer 1 · 2021-04-21T13:37:38.000Z

I ran into this problem when I train the model with the 'bottle' category. Have you find the reason?

Answer 2 · 2021-07-20T13:19:58.000Z

I have also found this problem? Do you know how to deal with it?

Answer 3 · 2022-03-23T11:44:55.000Z

I think there are something wrong with the loss functons. The compactness loss should be calculated before decoder, which is the last layer (the fc layer) of the teacher network. There are some conflicts between the KD loss and the compactness. The output of teacher network, which is the 512d vector, cannot be both compact and like the output of resnet. If you read the paper, you will find that the kd loss is D(T(p))-P(p), while the compactness loss is computed over all T(p) in a minibatch.

Answer 4 · 2022-05-03T15:21:50.000Z

I think much of the Teacher training approach is actually mistaken in this implementation. First, the authors of the paper have pretrained the Teacher Network using patches from ImageNet, not the target dataset. I assume that by using the Carpet dataset from MVTech, for example, for distilling the knowledge, you could overfit the Teacher Descriptors and end up screwing the Student training.
Also, @ganqieniurou is right, the Compactness Loss should be computed using the feature descriptors from the layer prior to the decoder (1x1x1xd). Otherwise, you are not making sure the feature descriptor distributions are being optimized during training.

Answer 5 · 2022-06-17T10:02:13.000Z

Hi there,

First, the authors of the paper have pretrained the Teacher Network using patches from ImageNet, not the target dataset. I assume that by using the Carpet dataset from MVTech, for example, for distilling the knowledge, you could overfit the Teacher Descriptors and end up screwing the Student training.

@ThiagoViek you are right this is how it's done in the original paper. Let's say that I deviated from there implementation because training on the whole ImageNet would imply too much computation on my side. So I decided to speed up the training process by proposing a Teacher that is specialized within a specific class of objects rather than designing a Teacher that has broader knowledge so to say.

Also, @ganqieniurou is right, the Compactness Loss should be computed using the feature descriptors from the layer prior to the decoder (1x1x1xd). Otherwise, you are not making sure the feature descriptor distributions are being optimized during training.

I don't remember the paper in details, but I understood the compactness loss as a way to promote sparsity for the decoder. This is why I applied it to the last layer. But I could be wrong indeed

Answer 6 · 2023-05-04T02:43:56.000Z

I think much of the Teacher training approach is actually mistaken in this implementation. First, the authors of the paper have pretrained the Teacher Network using patches from ImageNet, not the target dataset. I assume that by using the Carpet dataset from MVTech, for example, for distilling the knowledge, you could overfit the Teacher Descriptors and end up screwing the Student training. Also, @ganqieniurou is right, the Compactness Loss should be computed using the feature descriptors from the layer prior to the decoder (1x1x1xd). Otherwise, you are not making sure the feature descriptor distributions are being optimized during training.

Can you share the code on how you calculate the compact loss?