Modified Knowledge Transfer Technique

A particular teacher layer gives knowledge in the form of representations only to a particular student layer. This knowledge does not affect any other layer of the student network other than the target layer. By doing so, the student can learn more effectively compared to traditional knowledge transfer techniques. In neural networks every layer extracts specific features of the input image. Each layer of the student and teacher network extract different features. Hence, the features from different layers of the teacher network cannot be applied uniformly to all the layers of the student network.

For a better understanding, assume the first layer of the teacher and student network extract edges of the input image. Second layer extracts objects like circles. Hence, transfering knowledge from the first layer of the teacher to the second layer of the student is not beneficial. Random mapping of the teacher-student layers can sometimes be lucky but not always. I propose to use cosine similarity metric to find the mapping of student-teacher layers. In cosine similarity metric, the outputs of the student-teacher layer pairs are normalized. Dot product of the normalized outputs is calculated. Higher the value of the dot product, greater is the similarity.

Command to Execute Independent Student:

python vgg16_main.py --student True
--dataset caltech101
--learning_rate 0.0001
python vgg16_main.py --student True
--dataset cifar10
--learning_rate 0.0001

Command to Execute Dependent Student:

python vgg16_main.py
--dependent_student True
--train_dataset caltech101-train.txt
--test_dataset caltech101-test
--num_training_examples 5853
--num_testing_examples 1829
--NUM_EXAMPLES_PER_EPOCH_FOR_TRAIN 5853
--num_classes 102
--image_width 224
--image_height 224
--batch_size 45
--learning_rate 0.005
--dataset caltech101
python vgg16_main.py
--dependent_student True
--train_dataset cifar10-train.txt
--test_dataset cifar10-test
--num_training_examples 45000
--num_testing_examples 10000
--NUM_EXAMPLES_PER_EPOCH_FOR_TRAIN 45000
--num_classes 10
--image_width 32
--image_height 32
--batch_size 128
--learning_rate 0.01
--top_1_accuracy True
--dataset cifar10

Hyperparameters

-Independent Student - batch-size 45; learning rate 0.0001

Experiments

Train VGG16 Independent Student on Caltech101
Train VGG16 Dependent Student on Caltech101 with new method

Pre-Trained Weights