Distilling Knowledge from a image classification model with sigmoid function and binary cross entropy

Question

Distilling Knowledge from a image classification model with sigmoid function and binary cross entropy

publioelon opened this issue 3 years ago · 3 comments

Hi, I found this paper and github and it looks robust. I was wondering if it is possible to use your framework to distill knowledge from a cumbersome model used for image classification that uses sigmoid function for classification and binary cross entropy for loss computation. Since it is a cumbersome model trained on a custom dataset, I would like to know if I can use your framework to distill the knowledge to a smaller network that actually uses softmax for binary cross entropy, and what are the steps required to make it so?

Answer 1 · 2022-01-10T20:32:30.000Z

Hi @publioelon

Yes, I think you can design the experiments with torchdistill easily.
If you further clarify your settings, I can give you the steps to do that.

How many classes does your dataset have? Is it a binary-classification task (i.e., 2 classes)?
Why do you want to use the output from the sigmoid function for computing a loss? Because you use BinaryCrossEntropy module in PyTorch?
What model architectures do you want to use as teacher and student models? If they are your designed models, tell me about the input patch size (e.g., 224 x 224) and output shape

Also, please use Discussions tab above for questions. As explained here, I want to keep Issues mainly for bug reports.

Answer 2 · 2022-01-10T20:53:03.000Z

Hello @yoshitomo-matsubara thank you for replying and I apologize for not starting a discussion in the right thread.

EDIT: should I close this issue and re-open a thread in discussion?

I have a cumbersome model that does very well (high accuracy) on a fever classification task using thermal images. It is trained using transfer learning from a VGG16 architecture and the input shape is 128x160. it has two classes fever and healthy. From the papers I've read and experiments I've noticed they usually compute the KLDivergence loss between two softmax outputs. Due to the limited samples in my dataset and because I need to have softmax classification layer instead of sigmoid without retraining from scratch, I need to rely on Knowledge Distillation to compress the model size for single board computers that uses hardware accelerator.

Basically, I have a cumbersome tensorflow .h5 model that uses binary cross entropy (not sure why it uses binary cross entropy) I want to compress it to a smaller model that uses softmax for classification, so I can run it using edge tpu accelerator

Answer 3 · 2022-01-10T20:59:10.000Z

Hi @publioelon
No worries, can you close this issue and migrate you comment(s) to discussion?
I think there will be multiple interactions for this and you will have some followup questions through experiments, which Discussion would be more convenient for me.