choice of training epochs
hlml opened this issue · 10 comments
Hi Ahmed,
I was wondering about the choice of 200 training epochs for the baselines. Have you tried training for longer, e.g. matching the training time of the KE approaches? If I train for 2200 epochs without KE on Flower and CUB, I get 66% and 70%, respectively. This is better than the KE results after 10 generation. I'm curious to get your thoughts on this. Thanks!!
Hi Hattie,
The 200 epochs is the same used in the CS_KD paper [1]. I am surprised that you got these good results! I expected the loss to plateau especially on a small datasets like FLW. I wonder if you can achieve similar good results on a split-ResNet -- with a smaller inference cost.
I assume you are using the same learning rate schedule, optimizer, and vanilla cross-entropy.
[1] Regularizing Class-wise Predictions via Self-knowledge Distillation
One more thing, can you please repeat the same experiment on a bigger network, e.g., DenseNet169? On a bigger network, the chances of overfitting is bigger.
Thanks for the info! Yes, using the same scheduler, optimizer, and loss (label smoothing or cskd). Do you mean training for 2200 epochs on the split-Resnet? I guess if the split-Resnet is trained using KE first, then the baseline model would have trained for even longer in that case.
Same thing happens with DenseNet169, the baseline using the same hyperparams seems to outperform KE.
I used a cosine scheduler. In your experiment, does the scheduler reset multiple times -- after every 200 epochs -- or once after 2200 epochs?
The cosine scheduler can have multiple slopes depending on T_max params.
Right, I believe the scheduler only reset once in my experiments, so the actual learning rate would be different from KE, but the number of iterations would remain the same.
Thanks, Hattie, for running these interesting experiments. I am sorry I have no bandwidth to run similar experiments and confirm your findings.
That being said, your experiments are interesting because they open new questions. For instance, in Table no.9, KE significantly closes the gap between randomly initialized and pre-trained networks; with your recent experiments, it seems like you can close this gap even further without bells and whistles -- just increase the number of epochs! If so, do we really need to pre-train on ImageNet? I am still digesting the implications of your experiments; I am not sure how to interpret them in light of overfitting.
Regarding my Split-ResNet question, you achieved 66% on FLW using a ResNet -- with all its weights. Yet, in Table no.5, KE achieves significant performance improvement while reducing inference cost -- a relative 73% reduction. Accordingly, I am wondering how much improvement can you achieve with similar computational cost. Does that make sense?
Thanks
Hi Ahmed! Your question makes sense, I haven't run those specific experiments, but I think some of my other analyses does shed some light on related questions. I've included them as part of a new paper that I'd be happy to share once it's public (or email me for the preprint if interested)!
I am hoping to release the code that I've used for that project, and the KE related experiments build on top of your code. Since there isn't a license in this repo, would you be opposed to me releasing code that builds on top of repo? Alternatively, you can add a license to the repo (similar to https://github.com/allenai/hidden-networks/blob/master/LICENSE) if you'd like!
Hi Hattie,
I added a license file. Feel free to build on my repos.
I would be interested to take a look at your preprint.
Thanks
This seems related to KE and your experiments (ResNet vs DenseNet). Yet, these are news to me! What do you think?
Thank you! Do you have an email address I can use for more discussions?