The goal of knowledge distillation is to improve the performance of the half-witted model, which, most of the time, has fewer parameters, by allowing it to learn from the more competent model or the teacher model. The half-witted model, or the student model, excerpts the knowledge from the teacher model by matching its class distribution to the teacher model's. To make the distributions softer (used in the training process as part of the loss function), we can adjust a temperature T to them (this is done by dividing the logits before softmax by the temperature). This project designates EfficientNet-B0 as the teacher and SqueezeNet v1.1 as the student. These models will be experimented on the DermaMNIST dataset of MedMNIST. We will take a look at the performance of the teacher, the student (without knowledge distillation), and the student (with knowledge distillation) in the result section.
To witness the distillation in action, please refer to the notebook at the following link.
The quantitative results are delivered below in the form of a table.
Model | Loss | Accuracy |
---|---|---|
Teacher | 1.935 | 71.61% |
Student | 1.932 | 69.02% |
Distilled | 1.918 | 73.44% |
The loss curve on the train set and the validation set of the teacher model.
The accuracy curve on the train set and the validation set of the teacher model.
The loss curve on the train set and the validation set of the student model.
The accuracy curve on the train set and the validation set of the student model.
The loss curve on the train set and the validation set of the distilled model.
The accuracy curve on the train set and the validation set of the distilled model.
Comparison of loss curves between the teacher model, the student model, and the distilled model on the validation set.
Comparison of accuracy curves between the teacher model, the student model, and the distilled model on the validation set.
The qualitative results of the models on the test set are exhibited in the collated form below.
The qualitative result of the teacher model.