Computation time
sungyoon-lee opened this issue ยท 7 comments
Hi,
I've run mnist.py on a single Titan X (Pascal) with the default settings.
However, the speed is much slower(x3) than that reported in the literature (Table 1).
Scaling provable adversarial defenses
My attempt(=0.19*1200=230s/epoch) vs Report(=74s/epoch)
I think the only difference is that I'm using pytorch 1.4.0 and I've changed the code dual_layers.py (not using 'view' but using 'reshape').
Hi Sungyoon,
I don't currently have access to a Titan X to verify this exactly, but are you running the script with exact bound computation? The numbers in the paper reflect the use of random Cauchy projections described in section 3.2 (I believe with 50 random projections). Running the exact bound computation will of course be slower.
~Eric
@riceric22
Thank you for the quick response. I've run like followings:
server:~/convex_adversarial$ python examples/mnist.py
Also, I've tried with the argument, proj=50
server:~/convex_adversarial$ python examples/mnist.py --proj 50
But this also has a similar speed (=0.18x1200=216s/epoch).
I think it is slow because I use a single GPU instead of 4. When I tried with
server:~/convex_adversarial$ python examples/mnist.py --proj 50 --cuda_ids 0,1,2,3
This has a similar speed with that reported in the paper (=0.08x1200=96s/epoch).
Moreover, I can't run cifar.py with the default setting because of the memory error, so I have to use the argument cuda_ids=0,1,2,3. But I couldn't run cifar.py for the 'large' network with 4 GPUs, or even with 8 GPUs.
Hi Sungyoon,
In addition to adding --proj 50
you also need to specify --norm_train l1_median
and --norm_test l1_median
to use the median estimator for random projections during training and testing, otherwise it will still compute the exact bound (this is why you see the same speed). I realize this wasn't well documented in the code, thanks for bringing this up. MNIST definitely doesn't need more than one GPU, and also note that for MNIST it's possible to use even fewer random projections (e.g. 10) and still get comparable results.
Computing exact bounds on CIFAR10 does however run out of memory, due to the increased input size. It is not possible in my experience to run the exact bound on more than one example at a time; as a result, during training make sure you use the random projections to get the speeds reported in the paper meant for scaling these approaches.
~Eric
@riceric22
Thank you! The code is now running fast, even faster than that reported in the paper (=0.03x1200=36s/epoch).
server:~/convex_adversarial$ python examples/mnist.py --proj 50 --norm_train l1_median --norm_test l1_median
However, it causes an error with nan loss (3 trials). And I think it is faster because of the error.
Also, there is the same nan loss error for CIFAR-10.
It seems that somewhere after PyTorch 1.0, there was an underlying change in PyTorch which introduced NaNs into the projection code, as I'm able to run training normally without NaNs in my PyTorch 1.0 environment but I can reproduce the NaNs in my PyTorch 1.2 environment.
I'll take a look and try to narrow down what happened here, but you should be able to run this normally with PyTorch 1.0
Thank you very much, Eric. I tried it on Pytorch 1.0.0 environment, and it works with no error!
Did anyone manage to reproduce the cifar experiments in a more recent PyTorch environment (>=1.4.0) without getting NaNs with projections?