distillation from small networks

Target: large networks that originally overfit the training set can perform better than the small network used for distillation.

Experiments:

  • large network w/o distillation
  • large network w/ aggressive augmentation
  • large network w/ distillation
  • large network w/ distillation and aggressive augmentation
setting augment distill test top-1 acc test top-5 acc train top-1 acc train top-5 acc
resnet_original_20 67.36 90.96 88.29 98.77
resnet18 57.34 80.38 99.96 100.00
resnet18 x 55.01 77.69 99.97 100.00
resnet18 x
resnet18 x x