jcjohnson/cnn-benchmarks

vgg16 benchmark?

Closed this issue · 2 comments

So I was trying to look at the speed I get on some nets with TF and pytorch. (maxwell titan GPU)

I was just trying the forward pass, and I get similar results (tf being slightly slower usually) for the resnet-* architectures. But If I try with the VGG16 one, I get something way worse.

Model your benchmark pytorch (me) tf (me) Reported MatConvnet from here (interpolated, probably Pascal)
resnet-50 55.75 48.3 57.6 ~40
vgg16 62.30 113 169 ~80

I am a bit surprised all the others have a sharp increase (more than twice slower on average) from resnet-50 to vgg16 but not on your benchmark?.

My guess is that you don't have the cuDNN autotuner enabled in other frameworks. I'm not too familiar with TensorFlow or MatConvnet, but here is a quick PyTorch benchmarking script for VGG16:

import time
import torch
import torchvision
torch.backends.cudnn.benchmark = True

dtype = torch.cuda.FloatTensor
N, C, H, W = 16, 3, 224, 224

model = torchvision.models.vgg16()
print(model)
model.type(dtype)

times = []
for t in range(10):
  x = torch.randn(N, C, H, W).type(dtype)
  torch.cuda.synchronize()
  t0 = time.time()
  y = model(torch.autograd.Variable(x))
  torch.cuda.synchronize()
  t1 = time.time()
  times.append(t1 - t0)

print(times)

When I run this on my Maxwell Titan X I get:

1.2714393138885498
0.0651240348815918
0.06461572647094727
0.0646982192993164
0.0645449161529541
0.06469154357910156
0.06457901000976562
0.06456184387207031
0.06459999084472656
0.06464290618896484

Which matches my Lua Torch benchmark. However if I disable the cuDNN autotuner by deleting the line

torch.backends.cudnn.benchmark = True

Then I get times that match your results:

0.4282658100128174
0.11162209510803223
0.11102771759033203
0.11126351356506348
0.11090779304504395
0.11112117767333984
0.11123847961425781
0.11114788055419922
0.11153745651245117
0.11166501045227051

Aaaahhh! Thanks a lot, that was indeed the culprit, I get exactly your results with my script as well.

Though that does not explain the tensorflow slowness because auto-tuning is active by default... But that is unrelated to this repo. Are the torch kernels so much better optimized for 3x3 convolution? I would have expected some very similar performance since cudnn is doing most of the job.

Anyway, thanks for the answer :-)