haitongli/knowledge-distillation-pytorch

The train dataloader will be shuffled every epoch, Does it really work?

HisiFish opened this issue · 11 comments

In the code, the dataloader 'shuffle' switch is set to True.
So the teacher output can not actually work.

Can you clarify the question a bit more? What is the specific concern?
So the way the student model gets trained follows the same way of the teacher model. For one epoch, the training batches are used to compute KD loss to train the student. Then for another epoch, although dataloader is shuffled, KD loss should be still correct given new batches.

For example, If we have a dataset with 20 [image, label] pairs. We set the batch size to 4. So there are 5 iters in each epoch. We mark the origin data series indices 0~19.

In the code, we first fetch teacher outputs in one epoch, maybe the shuffled series indices is [[0,5,6,8],[7,9,2,4],[...],[...],[...]].

Then in kd training, another epoch, we need to caculate kd loss by (student outputs & teacher outputs & the labels). Now in current epoch, the indices may be shuffled to [[1,3,6,9],[10,2,8,7],[...],[...],[...]]. In code train.py:215, we get output_teacher_batch by i which is the new index of iters. While i is 0, the teacher outputs is from data [0,5,6,8] while the student outputs is from data [1,3,6,9].

I don't know whether I have the incorrect understanding. Thanks!

Sorry I did not fully understand. If you have time & are interested, could you run the test based on your understanding? Right now the KD-trained accuracies are consistently higher than native models, though it's only a bit higher. If your modification works better or makes better sense, feel free to make a pull request. Thanks in advance!

OK, I'll do that if I have a conclusion. Thanks.

Wait I think I get what you were saying. Basically, we need to verify that during training of the student model at each epoch, the batch sequence in the train dataloader stays the same as what was used during training of the teacher model. To that end, I think the PyTorch should be able to take care of that when specifying a random seed for reproducibility?

Maybe not.
It's easy to verify. The following is a simple example:

dataloader = ...   # in which the shuffle switch is turned ON.
for i in range(10):
    i = 0
    for img_batch, label_batch in dataloader:
        if i == 0:
            print(label_batch)
        i += 1

By comparing the first batch of 10 epoch, We can see the result.

I think the random seed can only make the behavior the same for different runs.
But can not make the behavior the same for different epochs in a certain run.

@HisiFish Have you solve this problem? Is it possible to compute the teacher output from the same input?
--Updated--
Actually, it helps increasing the acc by 0.10-0.20%.

Do you know what happens when you don't use enumerate but get batches via next(iter(data_loader))?

@HisiFish yes, you are right.
I put teacher_model and student model together. Finally, it works.
eg:

for img_batch, label_batch in dataloader:
      y_student = f_student(img_batch)
      with torch.no_grad():
            y_teacher = f_teacher(img_batch)

refer to: https://github.com/szagoruyko/attention-transfer/blob/master/cifar.py

@HisiFish yes, you are right.
I put teacher_model and student model together. Finally, it works.
eg:

for img_batch, label_batch in dataloader:
      y_student = f_student(img_batch)
      with torch.no_grad():
            y_teacher = f_teacher(img_batch)

refer to: https://github.com/szagoruyko/attention-transfer/blob/master/cifar.py

Hi @luhaifeng19947, I haven't followed the discussions here for a while. Are you interested in initiating a pull request?