/isaac

Automatically-Tuned Input-Aware implementations of HPC/DNN primitives

Primary LanguageC++MIT LicenseMIT

ISAAC

This is the development repository for ISAAC, an input-aware auto-tuning framework and code-generator for HPC/DL. This version is only compatible with NVIDIA hardware (it generates PTX source code). For OpenCL/CUDA compatibility, visit the Intel fork (https://github.com/intel/isaac) or the v1.0 branch (deprecated) or the

License

ISAAC is distributed under the MIT/X11 license.

Compilation

In order to compile and use ISAAC, only a proprietary NVIDIA driver is necessary. No CUDA SDK is required (except for testing and benchmarking against cuBLAS/cuDNN)

git clone https://github.com/ptillet/isaac.git
cd isaac; 
mkdir build; 
cd build;
cmake ../ ; make -j8;
./examples/isaac-tools --gemm --bench --suite deepbench --dtype float32
./examples/isaac-tools --conv --bench --suite deepbench --dtype float32

Python interface

The Tensorflow wrapper can be installed as follows in an environment where Tensorflow is present.

cd python;
python setup.py build; 
python setup.py install;

You can test the installation by executing:

python ./python/examples/benchmark.py

What the script does is pretty straightforward:

import isaac as sc
isaac = tf.load_op_library(sc.tensorflow)

Will expose isaac.conv2d and isaac.conv3d. You can use them like you'd use tf.nn.conv2d and tf.nn.conv3d.

If you don't want to use Tensorflow, it is possible to use the python bindings directly. See the "tune/" folder for an example.

Binary interface

Basic benchmarks for GEMM and CONV for DeepBench can be obtained using the isaac-tools binary interface:

Note that only float32 and float64 are supported at the moment.

If you want, you can also dump the PTX source code generated by ISAAC for some shapes:

./examples/isaac-tools --gemm --dump --format ptx --shape 2048,2048,2048 --layout NT --dtype float32

If you really know what you're doing, you can also capture the tiling parameters found by ISAAC:

./examples/isaac-tools --gemm --dump --format params --shape 2048,2048,2048 --layout NT --dtype float32

You will get the following output:

Tuning parameters: 4, 16, 8, 8, 8, 8, 16, 8, 16, 8, 1, 1, 1

The parameters respectively mean: (1) that shared memory loads have a width of 4 ; (2) each block comprises 16x8 threads ; (3) each threads computes a tile of 8x8 elements; (4) Each loop iteration processes 8 elements along the K axis ; (5) threads are rearranged as a 16 x 8 block for loading A, and a 16 x 8 block for loading B; (6) the reduction is split accross 1, 1 and 1 independent batches within each thread, thread-block and grid, and the results are accumulated after the inner-loop

Benchmarks

ISAAC often provides Tesla P100 - SGEMM: sgemm-gv100

Tesla P100 - DGEMM: sgemm-gv100

Tesla P100 - SCONV (vs cuDNN's IMPLICIT_PRECOMP_GEMM) sgemm-gv100

Coverage

I would consider GEMM and CONV as both being production-ready. Kernel selection is done for each new shape and the best kernel is cached in RAM. I wouldn't advise this library for applications that use 1000s of different shapes exactly once (e.g., Blocked SVD).