tensorflow/tensorflow

Feature request: Please provide AVX2/FMA capable builds

Opened this issue ยท 38 comments

I would go out on a limb and guess that the vast majority of Tensorflow users on Linux at least use fairly modern CPUs. It would therefore be beneficial for them to have the prebuilt TF binaries support AVX2/FMA. These two ISA extensions, and especially FMA, tend to speed up GEMM-like math pretty significantly.

It'd be great if TF team provided prebuilt Linux release *.whl that supports AVX2/FMA, perhaps as an alternative, non-default wheel. These should be compatible with Haswell and above. Haswell came out in 2013, lots of people have it by now.

To be clear, this is not a hugely pressing issue, *.whl can be easily rebuilt from source. It'd just make things faster and easier for people with modern CPUs.

What related GitHub issues or StackOverflow threads have you found by searching the web for your problem?

N/A

Environment info

Operating System:
Linux Ubuntu 16.04

Installed version of CUDA and cuDNN:
(please attach the output of ls -l /path/to/cuda/lib/libcud*): NONE

If installed from binary pip package, provide:

  1. A link to the pip package you installed: https://storage.googleapis.com/tensorflow/linux/cpu/tensorflow-1.0.0rc0-cp35-cp35m-linux_x86_64.whl
  2. The output from python -c "import tensorflow; print(tensorflow.__version__)": 1.0.0-rc0

If possible, provide a minimal reproducible example (We usually don't have time to read hundreds of lines of your code)

Code:

import tensorflow as tf
sess = tf.InteractiveSession()

Output:

W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations.
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations.
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX2 instructions, but these are available on your machine and could speed up CPU computations.
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use FMA instructions, but these are available on your machine and could speed up CPU computations.

What other attempted solutions have you tried?

Compiled from source.

Logs or other output that would be helpful

(If logs are large, please upload as attachment or provide link).

As a performance datapoint, this matrix multiplication benchmark goes 0.31 Tops/sec -> 1.05 Tops/sec when enabling avx2/fma on our Intel Xeon 3 @ 2.4 Ghz servers: https://github.com/yaroslavvb/stuff/blob/master/matmul_bench.py

On the other hand, there may be technical issues with infrastructure that make it hard to setup such a build. @caisq for comment

caisq commented

+@gunan

I believe our tooling and CI machines have the capacity to run bazel build with --copt=-mavx2 and --copt=-mfma. Gunhan, what do you think of expanding the nightly and release matrices to support those build options? My sense is that we are already a little constrained in terms of the machine resource and manpower to fix breakages.

gunan commented

This was discussed before, and the decision was to make the released binaries work for everyone.
While it is very easy for users to upgrade personal computers, many cloud providers or thousands of machines take longer to upgrade. With 0.12, we tried to enable avx and sse4.1, but we had to roll it back because it is not as common for AMD CPUs to have sse 4.1 as intel CPUs.

So, we decided this to be our policy about SIMD instruction sets going forward:
1 - Our released binaries will be as portable as possible, working out of the box on most of the machines.
2 - For people who need the best TF performance, we recommend building from sources. We will make sure things build well, and building from sources is as easy as possible, but rather than supporting a a new binary package for 10s of CPU architectures out in the wild, we decided the best would be to let users build binaries as needed.

Since Intel is getting involved, perhaps they would be willing to maintain an Intel-optimized build of TensorFlow? cc @mahmoud-abuzaina in case he has some connections

@dmitry-xnor I guess the issue here is limited resources at Google. Releasing an official wheel with a new configuration means you have to support it and fix issues that arise. I have seen some subtle alignment issues caused by enabling avx2, so troubleshooting such things can take time. And if you don't fix them, people get mad at Google, since the release is "official". Also, this sets a precedent for supporting a "highly optimized" binary + "lowest common denominator" binary, and the standard of "highly optimized binary" can shift over time. I do agree that it would be nice to have an Intel-specific build that's highly optimized.

I'm currently getting around this lack it by launching "build --config=opt --config=cuda" builds weekly and dropping resulting wheel in a shared folder for other users in our company.

@dmitry-xnor now to think of it, I could probably drop such binaries into a public shared folder as well, as I'm going through my build process. I'm building with --config=opt --config=cuda with CUDA 8.0 on Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz from Cirrascale which seems like a common configuration. The downside is that I don't have time to setup cloud storage/ research uploading, but if someone gave me an easy recipe to follow, I could do that.

Do you build as yourself interactively or is this an automated build? If you build as yourself, the solution is easily scriptable. Create a GCS bucket for binaries once, then just upload stuff there for every release and make it public like so, using gsutil:

gsutil cp $TENSORFLOW_WHL gs://<bucket name>/
gsutil acl set public-read gs://<bucket name>/$TENSORFLOW_WHL

And publish the resulting download URL somewhere. Note also that you should not rename the WHL or else pip3 will barf.

thanks, I'll give it a shot on the next build

Sounds great, thanks! Looking forward to prebuilt WHLs! It just doesn't make sense to use the CPU at 1/3rd the speed. ๐Ÿ‘

I have just launched a new website for this purpose: TensorFlow Community Wheels. Fully integrated with github.

gue22 commented

Hey guys,
need to dig some deeper into this thread, but to expedite things some thoughts here in advance:

  1. Am I mistaken? As far as I saw / understood on my machine there is not even optimization for SSE1? How about cutting off CPUs of a certain age / SSE for the default distribution?! (Naturally I'd appreciate a community effort for a finer-grained optimization offering!!)

  2. Do you have any insight how XLA / JIT / AOT announced last Wednesday (2017-02-15) comes to the rescue?

TIA
G.

Any suggestion for users of the TF docker images? These image have TF pre-installed.

gunan commented

I am using the GPU devel docker images, but right now I am just "using them" without rebuilding / reinstalling.

It is worth considering how far back in SSE instructions is reasonable to handle when dealing with a machine that needs to have CUDA compute >= 3.0 (for the gpu images or gpu wheels).

SSE was causing problems for people running on AMD CPU -- #6809

after upgrading to 1.0 I found the OSX prebuilt version lacked SSE, FMA and AVX support. after searching around for a while there's no alternative except to build it myself. Well then, i'll build it myself.

gunan commented

@yaroslavvb created this repository to link to community supported wheel files.
https://github.com/yaroslavvb/tensorflow-community-wheels

We encourage our community creating and maintaining specialized builds, but we will be creating wheel files that are installable in most platforms. Therefore, I will close this issue.

mvpel commented

For example, the glibc library is designed to work anywhere, and has mechanisms to detect the availability of advanced instruction sets and use the proper functions to take advantage of them when they're available, and fall back if they're not. It's not necessary to support a separate binary for every possible processor capability.

Intel Performance Primitives: https://software.intel.com/en-us/articles/intel-integrated-performance-primitives-intel-ipp-is-there-any-function-to-detect-processor-type

Linux Function Multi-Versioning (FMV):
https://clearlinux.org/features/function-multiversioning-fmv

mvpel commented

Here's a LWN article on the FMV capabilities provided for C++ in GCC 4.8 and up, and for C in GCC 6:

https://lwn.net/Articles/691932/

I built these for OS X, with FMA and friends.
https://github.com/ctmakro/tensorflow_custom_python_wheel_build

@gunan For user needing best performance ... build from sources ...

We will make sure things build well, and building from sources is as easy as possible

That would be acceptable, if there was an easy way of building Tensorflow on Windows. Apparently, there isn't: People try but the official documentation on that clearly states that Windows is currently not supported. So it would be very valuable, if there were optimized builds available or you could follow up on @mvpel's suggestion on detecting cpu and enabling optimizations dynamically. Meanwhile, I will try to follow the instructions from here

FYI: Following up on my last comment, I built the GPU-Version of Tensorflow with CPU-Optimizations (AVX) enabled and I couldn't see much performance improvements on my side, so I will stick to the pre-build GPU-version that can be installed using pip install tensorflow-gpu==1.1.0

@apacha From my experiments I found CPU-optimized GPU TF doesn't boost the performance significantly, but it can make the CPU cooler. My processor's temperature often goes up to 80C during training, while the optimized TF usually keeps the temp. below 70C.

mvpel commented

@xinyazhang - that can have performance implications, albeit slight, since CPUs will throttle their frequency if they are pushed into the upper limits of their temperature range for too long.

@apacha - There's not much point to vector instructions in a GPU-enabled TF runs, since the work that would be done by those instructions in the CPU is done in the GPU much more quickly, and so the fact that there's little performance improvement with AVX on a GPU-based run is to be expected.

The basic idea is that there's far more machines out there with AVX, AVX2, SSE, etc. than there are with GPUs, and they're much cheaper to rent in the cloud (an AWS c4.large with AVX2 is 10 cents per hour, while the smallest GPU instance p2.xlarge is 90 cents an hour), so wringing out every last bit of CPU performance potential for non-GPU runs can be of benefit provided that a TF job on c4.large doesn't take 9 times longer than on p4.xlarge.

@mvpel "There's not much point to vector instructions in a GPU-enabled TF runs"

CPU is very much a bottleneck with today's faster GPUs on certain models. Typically for e.g. computer vision problems you need to do a bunch of data decoding and augmentation, much of which can't be done on the GPU. This is actually a major problem we had with TF for multi-GPU training. Things were so bad (even with AVX2 and FMA enabled) that we switched to using PyTorch just for data augmentation in our TF pipelines. For what we do, it was an easy 40% throughput gain right off the bat, and code was quite a bit simpler too.

The point is: GPUs are specialized devices, and while they are powerful, they are not really usable for everything. Things are pretty bad even now for high throughput tasks, and I imagine they'll get much worse when we TPUs and NVIDIA V100 GPUs become available.

For anyone looking for optimized builds, we maintain a bunch of these at TinyMind that you can find at https://github.com/mind/wheels.

There are both CPU and GPU builds for all versions post TF1.1. Enjoy :)

bhack commented

@gunan Why the TF team cannot official maintain some alternative builds like suggested in the previous comment?

@bhack it's a business level decision (what's the best use of Google engineer time?). Providing custom hardware builds is possible by people outside of Google, but there are many Tensorflowy things that can only be done by Googlers.

PS: whole "bake AVX2 into binary" is not that great for open-source ecosystem -- TensorFlow would be better off with dynamic dispatch system like what's used by PyTorch, MKL.

bhack commented

I don't think that there is so much effort required other than hardware resources cause I think that AVX2 code paths are still tested in the matrix. When code is tested, and so builded, i think that it is quite automatic to publish it. But never mind, Intel is already maintaining optimized builds with a sort of delay over the official upstream releases.

I don't how that Intel build works -- does it/conda automatically figure out which instruction sets your machine has and get the proper version? Or it just automatically pushes XeonV4 optimized build?

This wasn't an issue with MKL because MKL has dynamic dispatch, but TF has to have advanced instructions statically baked in there

At any rate, feel free to use the wheels I posted above @bhack - we rely on these ourselves so we will keep maintaining them. :)

How can make the tensrflow installed on my machine to compile SSE, AVX, FMA instructions?

If you use ubuntu 16.04, check out the link I posted above (https://github.com/mind/wheels) - there you can find the version you want as well as how to install it.

I have tensorflow wheels for a few different configurations at https://github.com/lakshayg/tensorflow-build

Hi all, I'd pay $75 for someone to help me write a Dockerfile for building tensorflow wheels.

We could put it in a public GitHub repo so folks could use it as a reference. I keep getting stuck when trying to follow the official tensorflow docs. My last bazel build ... attempt ended with this Error:

$ git clone -b "r2.10" --single-branch https://github.com/tensorflow/tensorflow.git

$ USE_BAZEL_VERSION=$(cat .bazelversion) \
    bazel build \
    --copt=-mavx2 --copt=-mfma \
    //tensorflow/tools/pip_package:build_pip_package

ERROR: /tensorflow/tensorflow/compiler/mlir/lite/BUILD:295:11: Compiling tensorflow/compiler/mlir/lite/ir/tfl_ops.cc failed: (Exit 1): gcc failed: error executing command /usr/bin/gcc -U_FORTIFY_SOURCE -fstack-protector -Wall -Wunused-but-set-parameter -Wno-free-nonheap-object -fno-omit-frame-pointer -g0 -O2 '-D_FORTIFY_SOURCE=1' -DNDEBUG -ffunction-sections ... (remaining 251 arguments skipped)
gcc: fatal error: Killed signal terminated program cc1plus
compilation terminated.
Target //tensorflow/tools/pip_package:build_pip_package failed to build
Use --verbose_failures to see the command lines of failed build steps.
INFO: Elapsed time: 9695.639s, Critical Path: 4770.50s
INFO: 4737 processes: 771 internal, 3966 local.
FAILED: Build did NOT complete successfully

I appreciate the other projects that have been linked here, but none of them have the scripts used for actually doing the builds.

I specifically need a build with AVX2 and FMA, and could use another with AVX2, FMA, and AVX512F (for running on AWS Fargate). Python 3.8-3.10.

I miss GitHub notifications often, so reach out to me on LInkedin if you're interested: https://www.linkedin.com/in/eric-riddoch/