Hyperparameter discrepancy between paper(s) and code
Opened this issue · 2 comments
Hello @xuguodong03 @liuziwei7
Thank you for your interesting study and publishing your code for cifar dataset!
I recently found a couple of issues regarding hyperparameters. Could you please answer the following questions?
1. ECCV vs. arXiv papers
In your ECCV supplementary material, you say
For student training, we set λ1 = 0.1, λ2 = 0.9, λ3 = 3.0, λ4 = 5.0 in Eq 8.
but your arXiv paper says
For student training, we set λ1 = 0.1, λ2 =0.9, λ3 = 2.7, λ4 = 10.0 in Eq 8.
Could you tell us which set of hyperparameters are actually used to obtain the best results?
2. Paper vs. Code
Even if the second set of hyperparameters (in arXiv paper) is used, it looks like the values of λ3 and λ4 are swapped in your code.
i.e., λ3 = 10.0, λ4 = 2.7 (in your code) instead of λ3 = 2.7, λ4 = 10.0 (in arXiv paper)
https://github.com/xuguodong03/SSKD/blob/master/student.py#L41-L42
https://github.com/xuguodong03/SSKD/blob/master/student.py#L295
So, which one is correct?
3. ImageNet
Lastly, I would like to know the hyperparameters you used to train teacher's proj_head (e.g., number of epochs) before training student, which I couldn't find in your papers, in addition to hyperparameters used to train students as #7 asked.
Could you please provide the information here for a fair comparison?
Thank you!
Thanks for your attention to our work.
The results in paper come from the hyper-parameters in this repo, i.e., λ3 = 10.0, λ4 = 2.7. The numbers in arxiv version and ECCV version have some careless typos. We are really sorry to bring you confusion.
In ImageNet, we use the same temperatures and loss weights as those in CIFAR100 experiments, except that λ1 is set to 1.0. We train the teacher proj_head for 30 epochs with an initial lr=0.1 decayed by 10 at 10,20 epoch. We train the student model for
100 epochs. The initial learning rate is 0.1 and is decayed by 10, respectively, at 30, 60 and 90 epochs. We train models with eight parallel GPUs with a total batch size of 256. The optimizer parameters are the same as those in CIFAR100 experiments.
Thank you @xuguodong03 for your prompt response!
Could you clarify three more things:
We train models with eight parallel GPUs with a total batch size of 256.
-
Does this mean you trained the teacher and student models wrapped by
DistributedDataParallel
with
python -m torch.distributed.launch --nproc_per_node=8 --use_env <YOUR IMAGENET CODE> --world_size 8 ...
and batch size=32 (32 x 8 = 256)?
Or did you use just one process (i.e. withouttorch.distributed.launch
) and trained models wrapped byDataParallel
? -
If you used
torch.distributed.lauch
, did you also applySyncBatchNorm
to student model?
https://github.com/pytorch/vision/blob/master/references/classification/train.py#L172 -
For ResNet-18 and -34 in your ImageNet experiment, did you feed the flattened output of
avgpool
(below) toproj_head
, and thefeat_dim=512
?
https://github.com/pytorch/vision/blob/master/torchvision/models/resnet.py#L214