exact settings to reproduce 5.82% robust error for MNIST
gwding opened this issue · 5 comments
I'm trying to reproduce the MNIST results.
I modified the code such that the "warmup" (epsilon from 0.05 to 0.1) takes 10 epochs
and then I run
python mnist.py --epochs 100
and the test robust error I got is 6.57%.
So I wonder am I exactly replicating the settings in the paper?
Shall I try different random seed?
should I set --scatter_grad
and --alpha_grad
? which are False by default.
(also, it seems that l1_proj
appears in a few places, but actually wasn't in the dual.py
)
never mind, it seems that after setting --scatter_grad
and --alpha_grad
, it reaches the reported performances after 21 epochs.
But I'm still curious on the motivation of having an option to not use their gradient? was that a better choice under certain circumstances?
Hi Gavin! Thanks for opening the issue, as it helped me find a typo in the paper: the schedule is actually set to take uniform steps in the first half of all the epochs, which is how its implemented in the repository. So instead of taking steps over the first 10 epochs, it should be taking steps over the first 50 epochs if you're taking 100 epochs. Using this, will reproduce the result. In general, you'll achieve the best performance with more gradual steps.
However it is quite interesting to know that the same value can be achieved after 21 epochs when setting the flag! Originally, not using those gradients was actually a mistake in the code, but when I corrected it, I found that using the gradients actually worsened performance in the preliminary experiments, so I left it as a flag that defaults to not using the gradients.
However, this was before I had any sort of epsilon scheduling (it was fixed to its initial value); so it appears that with a proper epsilon schedule, ignoring gradients actually hurts! I'll test it out and if winds up true, I will probably change the default value to True, so that it uses all the gradient information.
Hi Eric, Thanks for the clarification!
I've run a few more experiments on this, and so far it looks like you're right: with the use of an epsilon schedule, it is no longer beneficial to ignore the gradients and we can get even better results when we use them. I'll include this change in a major update to the repository at the end of this month, but for now it suffices to supply the gradient flags. Thanks for finding this out!