dlib-pytorch-benchmark

A very naive and simple benchmark between dlib and pytorch in terms of space and time.

This benchmarks were run on a NVIDIA GeForce GTX 1080 Ti with CUDA 10.2.89 and CUDNN 7.6.5.32 on Arch Linux.

Model instantiation

Probably, this is a completely useless benchmark, but it's provided for completion, nonetheless.

PyTorch

model = resnet50(pretrained=False)

dlib

resnet<dlib::affine>::_50 net;

1st inference

This is also not very meaningful, since most of the time is spent allocating memory in the GPU.

PyTorch

x = torch.zeros(512, 3, 224, 224)
x = x.cuda()
model = model.cuda()
# time measurement start
out = model(x)
# time measurement end

dlib

dlib::matrix<dlib::rgb_pixel> image(224, 224);
dlib::assign_all_pixels(image, dlib::rgb_pixel(0, 0, 0));
std::vector<dlib::matrix<dlib::rgb_pixel>> minibatch(512, image);

At this point, we could just call:

const auto out = net(minibatch, 512);

But that wouldn't be a fair comparison, since it would do some extra work:

apply softmax to the output of the net
transfer the result from the device to the host

As a result, we need to forward a tensor that is already in the device. There are several ways of doing it, here's one:

dlib::resizable tensor x;
net.to_tensor(minibatch.begin(), minibatch.end(), x);
x.device();
// time measurement start
net.subnet().forward(x);
// time measurement end

Now dlib is doing exactly the same operations as PyTorch, as far as I know.

Next inferences

In my opininion, the most important benchmark is this one. It measures how the network performs once it has been "warmed up".

For this part, I decided not to count the cuda syncronization time, only the inference time for a tensor that is already in the device.

PyTorch

In PyTorch, every time I forward the network, I make sure all the transfers between the host and the device have been finished:

for i in range(10):
    x = x.cpu().cuda()
    # time measurement start
    out = model(x)
    # time measurement end

The times measured for each inference are around 6 ms, no matter the batch size (which is a good indicator that there are no memory transfers).

dlib

For dlib I followed a similar pattern:

for (int i = 0; i < 10; ++i)
{
    x.host();
    x.device();
    // time measurement start
    net.subnet().forward(x);
    // time measurement end
}

Here, the times measured for the first inference varies with the batch size (for 128 is around 90 ms). However, the rest of forward calls are around 0.9 ms and indenpendent from the batch size.

Since the first call timing variability is systematic, we can just ignore it, since when the network works in a steady state the forward pass time is constant.

Nevertheless, if somebody has any idea of why this is happening, I would really love to know more.

Results

The following table shows the average timings in ms for a tensor of shape 128x3x224x224.

Test	PyTorch	dlib	Factor
instantiation	239.672	0.078	3072.718
1st inference	1160.368	2609.590	2.250
next inference	6.164	0.905	6.811

I've also measured the VRAM usage in MiB for different batch sizes:

batch size	PyTorch	dlib	Factor
0	473	600	1.27
1	711	632	0.89
2	709	706	0.99
4	729	840	1.15
8	765	1092	1.43
16	881	1556	1.77
32	1211	2604	2.15
64	1689	4536	2.68
128	2303	8374	3.64

Conclusions

From this simple benchmark I can only draw the obvious conclusion: dlib is faster but uses more VRAM than PyTorch.

soumith/dlib-pytorch-benchmark

dlib-pytorch-benchmark

Model instantiation

PyTorch

dlib

1st inference

PyTorch

dlib

Next inferences

PyTorch

dlib

Results

Conclusions