xuguodong03/SSKD

Hyperparameter discrepancy between paper(s) and code

Opened this issue · 2 comments

Hello @xuguodong03 @liuziwei7

Thank you for your interesting study and publishing your code for cifar dataset!
I recently found a couple of issues regarding hyperparameters. Could you please answer the following questions?

1. ECCV vs. arXiv papers

In your ECCV supplementary material, you say

For student training, we set λ1 = 0.1, λ2 = 0.9, λ3 = 3.0, λ4 = 5.0 in Eq 8.

but your arXiv paper says

For student training, we set λ1 = 0.1, λ2 =0.9, λ3 = 2.7, λ4 = 10.0 in Eq 8.

Could you tell us which set of hyperparameters are actually used to obtain the best results?

2. Paper vs. Code

Even if the second set of hyperparameters (in arXiv paper) is used, it looks like the values of λ3 and λ4 are swapped in your code.
i.e., λ3 = 10.0, λ4 = 2.7 (in your code) instead of λ3 = 2.7, λ4 = 10.0 (in arXiv paper)
https://github.com/xuguodong03/SSKD/blob/master/student.py#L41-L42
https://github.com/xuguodong03/SSKD/blob/master/student.py#L295
So, which one is correct?

3. ImageNet

Lastly, I would like to know the hyperparameters you used to train teacher's proj_head (e.g., number of epochs) before training student, which I couldn't find in your papers, in addition to hyperparameters used to train students as #7 asked.
Could you please provide the information here for a fair comparison?

Thank you!

Thanks for your attention to our work.

The results in paper come from the hyper-parameters in this repo, i.e., λ3 = 10.0, λ4 = 2.7. The numbers in arxiv version and ECCV version have some careless typos. We are really sorry to bring you confusion.

In ImageNet, we use the same temperatures and loss weights as those in CIFAR100 experiments, except that λ1 is set to 1.0. We train the teacher proj_head for 30 epochs with an initial lr=0.1 decayed by 10 at 10,20 epoch. We train the student model for
100 epochs. The initial learning rate is 0.1 and is decayed by 10, respectively, at 30, 60 and 90 epochs. We train models with eight parallel GPUs with a total batch size of 256. The optimizer parameters are the same as those in CIFAR100 experiments.

Thank you @xuguodong03 for your prompt response!

Could you clarify three more things:

We train models with eight parallel GPUs with a total batch size of 256.

  1. Does this mean you trained the teacher and student models wrapped by DistributedDataParallel with
    python -m torch.distributed.launch --nproc_per_node=8 --use_env <YOUR IMAGENET CODE> --world_size 8 ...
    and batch size=32 (32 x 8 = 256)?
    Or did you use just one process (i.e. without torch.distributed.launch) and trained models wrapped by DataParallel?

  2. If you used torch.distributed.lauch, did you also apply SyncBatchNorm to student model?
    https://github.com/pytorch/vision/blob/master/references/classification/train.py#L172

  3. For ResNet-18 and -34 in your ImageNet experiment, did you feed the flattened output of avgpool (below) to proj_head, and the feat_dim=512?
    https://github.com/pytorch/vision/blob/master/torchvision/models/resnet.py#L214