jcjohnson/cnn-benchmarks

CPU performance update

mingfeima opened this issue · 9 comments

@jcjohnson hi, really nice benchmark!

I am working on torch optimization for intel platforms, Xeon and Xeon Phi. Our optimized version is much faster than original torch cpu backend and we are trying to upstream. https://github.com/intel/torch.

About the saying "The Pascal Titan X with cuDNN is 49x to 74x faster than dual Xeon E5-2630 v3 CPUs." this is somehow missleading :(

This is because a)pascal is the latest generation of gpu while Xeon E5 v3 is about 4 years ago b)our intel-torch yields much faster performance that is competitive to GPU.

We are happy to update the benchmark performance on intel latest hardware platforms, Xeon E5-2699v4, Xeon Phi 7250 (KNL) and also the upcoming platforms (SKL/KNM).

Could you please update these numbers once we finished? :)

@mingfeima I totally agree that the CPU benchmark is pretty far from optimal.

I don't have a lot of experience with squeezing every last bit of performance on CPU, so these numbers were just using a vanilla Torch installation on the cluster machines that I had available. I'm not at all surprised that an expert like yourself could get much better numbers on CPU / Xeon Phi.

I'm happy to update the benchmarks with your results once you are ready.

Hi @jcjohnson ! This repo is pretty awesome! Congrats!
Quick question though, when talking about the CPU benchmarks, you said a "cluster of machines", so were these benchmarks run on a distributed manner? or are they all single machine runs?
Thanks!

@renato2099 They are are single-machine runs; however that single machine is one of the machines in our research cluster.

Thanks for the prompt reply @jcjohnson !
Could you please provide the description of such machine(s)?

From the README:

All benchmarks were run in Torch. The GTX 1080 and Maxwell Titan X benchmarks were run on a machine with dual Intel Xeon E5-2630 v3 processors (8 cores each plus hyperthreading means 32 threads) and 64GB RAM running Ubuntu 14.04 with the CUDA 8.0 Release Candidate. The Pascal Titan X benchmarks were run on a machine with an Intel Core i5-6500 CPU and 16GB RAM running Ubuntu 16.04 with the CUDA 8.0 Release Candidate. The GTX 1080 Ti benchmarks were run on a machine with an Intel Core i7-7700 CPU and 64GB RAM running Ubuntu 16.04 with the CUDA 8.0 release.

sorry @jcjohnson I meant the CPU benchmarks that were reported using a Dual Xeon E5-2630 v3, were they run using a 16GB RAM or 64GB RAM? Thanks!

The CPU machine was the same as the 1080 and Maxwell Titan X machines, so it had dual E5-2630 v3 and 64 GB RAM.

Also you really shouldn't take the CPU benchmarks here too seriously - I didn't spend any time tuning BLAS, and it's very possible that more carefully tuning the BLAS installation could have improved results. Also the CPU benchmarks use Torch's built-in convolution routine, which is not heavily optimized for CPU - it's very likely that using something like NNPACK would again lead to significantly improved CPU performance.

thanks @jcjohnson ! Yeah I see what you mean, it's just to have a basic idea of the difference between CPUs and GPUs for a specific model/dataset. So the E5-2630 v3 has 8 cores, how many processors does this machine have? 4?

@jcjohnson @renato2099 we have optimized torch performance on CPU platform.
CPU optimization is provided by mklnn and mkltorch, to install torch with mklnn and mkltorch, refer to
https://github.com/intel/torch

currently we have some little problem running cnn-benchmark, since some modules used in Resnet is not included in mklnn, we will fix this problem as soon as possible.

after this, we can provide cnn-benchmark performance on latest intel platform, say Xeon Skylake 8180 and also Knight Mill.

also please be aware that our optimization ONLY focuses on server CPUs, a.k.a. Xeon and Xeon Phi. Desktop CPUs such as core i5, i7 is not on the list.