triSYCL

1 News

2017/03/03: triSYCL can use CMake & ctest and works on Windows 10 with Visual Studio 2017. It works also with Ubuntu WSL on Windows. :-) More info
2017/01/12: Add test case using the Xilinx compiler for FPGA
2016/11/18: If you missed the free SYCL T-shirt on the Khronos booth during SC16, you can always buy some on https://teespring.com/khronos-hpc (lady's sizes available, so no excuse! :-) )
2016/08/12: OpenCL kernels can be run with OpenCL kernel interoperability mode now.
2016/04/18: SYCL 2.2 provisional specification is out.
This version implement SYCL 2.2 pipes and reservations plus the blocking pipe extension from Xilinx.

2 Table of content

Contents

1 News
2 Table of content
3 Introduction
4 OpenCL SYCL
5 OpenCL triSYCL code documentation
6 Installation
7 Examples and tests
- 7.1 Generating the Doxygen documentation
8 Possible futures

triSYCL is an open source implementation to experiment with the specification of the OpenCL SYCL 1.2.1 and 2.2 C++ layer and to give feedback to the Khronos OpenCL SYCL and OpenCL C++ 2.2 kernel language committees.

This SYCL implementation is mainly based on C++1z (2017?) and OpenMP with execution on the CPU right now, but some parts of the non single-source OpenCL interoperability layer are implemented and the device compiler development is on-going for SPIR and SPIR-V. Since in SYCL there is a host fall-back, this CPU implementation can be seen as an implementation of this fall-back too...

The parallel kernels can be executed in parallel on the CPU with OpenMP in the first range dimension, if compiled with OpenMP support or on an OpenCL device with the interoperability mode (which is not single source)

For legal reasons, the specification used for this open source project is the published current provisional specification and not the last one currently discussed in the Khronos OpenCL SYCL committee. If you are a Khronos member, you can access to https://gitlab.khronos.org/sycl/triSYCL where you might find more futuristic branches.

This is provided as is, without any warranty, with the same license as LLVM/Clang.

Technical lead: Ronan at keryell point FR. Developments started first at AMD and are now mainly funded by Xilinx.

4 OpenCL SYCL

OpenCL SYCL is a single-source C++14/C++17-based DSEL (Domain Specific Embedded Language) aimed at facilitating the programming of heterogeneous accelerators by leveraging the OpenCL language and concepts.

Note that even if the concepts behind SYCL are inspired by OpenCL concepts, the SYCL programming model is a very general asynchronous task graph model for heterogeneous computing with no relation with OpenCL itself, except when using the OpenCL API interoperability mode.

OpenCL SYCL is developed inside the Khronos OpenCL SYCL committee and thus, for more information on SYCL, look at http://www.khronos.org/sycl

For the SYCL ecosystem, look at http://sycl.tech

4.1 Why you could use SYCL

SYCL has a lot of interesting advantages compared to plain OpenCL or other approaches:

SYCL is an open standard from Khronos with a working committee (you can contribute!) and we can expect several implementations (commercial or open source) on many platforms soon, ranging from GPU, APU, FPGA, DSP... down to plain CPU;
it offers a single-source C++ programming model that allows taking advantage of the modern C++14/C++17 superpower, unifying both the host and accelerator sides. For example it is possible to write generic accelerated functions on the accelerators in a terse way by using (variadic) templates, meta-programming and generic variadic lambda expressions. This allows to build templated libraries such as Eigen or TensorFlow in a seamless way;
SYCL abstracts and leverages the concepts behind OpenCL and provides higher-level concepts such as tasks (or command group in OpenCL SYCL jargon) that allow the runtime to take advantage of a more task graph-oriented view of the computations. This allows lazy data transfers between accelerators and host or to use platform capabilities such as OpenCL 2 SVM or HSA for sharing data between host and accelerators;
the entry cost of the technology is zero since, after all, an existing OpenCL or C++ program is a valid SYCL program;
the exit cost is low since it is pure C++ without any extension or #pragma, by opposition to C++AMP or OpenMP for example. Retargeting the SYCL classes and functions to use other frameworks such as OpenMP 4 or C++AMP is feasible without rewriting a new compiler for example;
easier debugging
- since all memory accesses to array parameters in kernels go through accessors, all the memory bound checks can be done in them if needed;
- since there is a pure host mode, the kernel code can be run also on the host and debugged using the usual tools and use any system (such <cstdio> or <iostream>...) or data libraries (for nice data visualization);
- since the kernel code is C++ code even when run on an accelerator, instrumenting the code by using special array classes or overloading some operators allows deep intrusive debugging or code analysis without changing the algorithmic parts of the code;
SYCL is high-level standard modern C++ without any extension, that means that you can use your usual compiler and the host part can use at the same time some cool and common extensions such as OpenMP, OpenHMPP, OpenACC,... or libraries such as MPI or PGAS Coarray++, be linked with other parts written in other languages (Fortran...). Thus SYCL is already Exascale-ready!
even if SYCL hides the OpenCL world by default, it inherits from all the OpenCL world:
- same interoperability as the OpenCL underlying platform: Vulkan, OpenGL, DirectX...
- access to all the underlying basic OpenCL objects behind the SYCL abstraction for interoperability and hard-core optimization;
- construction of SYCL objects from basic OpenCL objects to add some SYCL parts to an existing OpenCL application;
- so it provides a continuum from higher-level programming à la C++AMP or OpenMP 4 down to low-level OpenCL, according to the optimization needs, from using simple OpenCL intrinsics or vector operation from the cl::sycl namespace down to providing a real OpenCL kernel to be executed without requiring all the verbose usual OpenCL host API.
This OpenCL seamless integration plus the gradual optimization features are perhaps the most compelling arguments for SYCL because it allows high-level programming simplicity without giving-up hard-core performance when needed;
since the SYCL task graph execution model is asynchronous, this can be used by side effect to overcome some underlying OpenCL implementation limitations. For example, some OpenCL stacks may have only in-order execution queues or even synchronous (blocking) ND-range enqueue, or some weird constrained mapping between OpenCL programmer level queue(s) and the hardware queues.

In this case, a SYCL implementation can deal with this, relying for example on multiple host CPU threads, multiple thread-local-storage (TLS) queues, its own scheduler, etc. atop the limited OpenCL stack to provide computation and communication overlap in a natural pain-free fashion. This relieves the programmer to reorganize her application to work around these limitation, which can be quite a cumbersome work.

for introduction material on the interest of DSEL in this area, look for example at these articles:

Domain-specific Languages and Code Synthesis Using Haskell, Andy Gill. May 6, 2014 in ACM Queue and Communications of the ACM.
Design Exploration through Code-generating DSLs, Bo Joel Svensson, Mary Sheeran and Ryan Newton. May 15, 2014 in ACM Queue and Communications of the ACM.

4.2 Some other implementations

Some other known implementations:

Codeplay has a ComputeCpp product implementing SYCL based on OpenCL SPIR with Clang/LLVM https://www.codeplay.com/products/computesuite/computecpp
SYCL-GTX https://github.com/ProGTX/sycl-gtx

4.3 Some presentations and publications related to SYCL

By reverse chronological order:

Post-modern C++17 abstractions for heterogeneous computing with Khronos OpenCL SYCL. Ronan Keryell. Paris C++ User Group Meetup, Paris, France. January 19, 2017.
Khronos Group SYCL standard --- triSYCL Open Source Implementation, Ronan Keryell (Xilinx & Khronos OpenCL SYCL Working Group Member). November, 2016, Presentation at SuperComputing 2016, Salt Lake City, USA.
P0367R0: Accessors — wrapper classes to qualify accesses, Ronan Keryell (Xilinx) & Joël Falcou (NumScale). November, 2016, Presentation at ISO C++ committee, Issaquah, WA, USA.
Experiments with triSYCL: poor (wo)man shared virtual memory. Ronan Keryell. SYCL 2016 - 1st SYCL Programming Workshop, collocated with PPoPP'16, Barcelona, Spain. March 13, 2016.
Khronos's OpenCL SYCL to support Heterogeneous Devices for C++. Proposal for the C++ committee SG14 in Jacksonville, Florida, USA February 12, 2016.
SYCL presentation at SG14 C++ committee teleconference, Andrew Richards (CEO Codeplay & Chair SYCL Working group). February 3, 2016.
Post-modern C++ abstractions for FPGA & heterogeneous computing with OpenCL SYCL & SPIR-V, Ronan Keryell. ANL REFORM 2016: Workshop on FPGAs for scientific simulation and data analytics, Argone National Labs. January 22, 2016.
From modern FPGA to high-level post-modern C++ abstractions for heterogeneous computing with OpenCL SYCL & SPIR-V, Ronan Keryell. HiPEAC WRC 2016: Workshop on Reconfigurable Computing. Prague, January 19, 2016.
HiPEAC 2016 tutorial on SYCL: Khronos SYCL for OpenCL. HiPEAC 2016, Prague, January 18, 2016.
A Tutorial on Khronos SYCL for OpenCL at IWOCL 2015. Stanford, May 12, 2015.
Modern C++, OpenCL SYCL & OpenCL CL2.hpp, Ronan Keryell (AMD & Khronos OpenCL SYCL Working Group Member). November 18, 2014, Presentation at SuperComputing 2014, OpenCL BoF, New Orleans, USA.
Implementing the OpenCL SYCL Shared Source C++ Programming Model using Clang/LLVM, Gordon Brown. November 17, 2014, Workshop on the LLVM Compiler Infrastructure in HPC, SuperComputing 2014
SYCL Specification --- SYCL integrates OpenCL devices with modern C++, Khronos OpenCL Working Group — SYCL subgroup. Editors: Lee Howes and Maria Rovatsou. Version 1.2, Revision 2014-09-16.
OpenCL 2.0, OpenCL SYCL & OpenMP 4, open standards for heterogeneous parallel programming, Ronan Keryell (AMD & Khronos OpenCL Working Group Member). July 3, 2014, Presentation at the Meetup of the High Performance Computing & Supercomputing Group of Paris.
OpenCL 2.0, OpenCL SYCL & OpenMP 4, open standards for heterogeneous parallel programming, Ronan Keryell (AMD & Khronos OpenCL Working Group Member). July 2, 2014, Presentation at Forum Ter@tec: Calcul scientifique & Open Source : pratiques industrielles des logiciels libres.
The Future of Accelerator Programming in C++, Sebastian Schaetz, May 18, 2014. Presentation at C++Now 2014.
SYCL : Abstraction Layer for Leveraging C++ and OpenCL, Maria Rovatsou (Codeplay & Khronos OpenCL Working Group Member). May 12-13, 2014, IWOCL 2014.
Building the OpenCL ecosystem - SYCL for OpenCL, Lee Howes (Senior Staff Engineer at Qualcomm & Khronos OpenCL Working Group Member). April 21, 2014, HPC & GPU Supercomputing Group of Silicon Valley.
SYCL 1.2: Unofficial High-Level Overview, AJ Guillon (Khronos OpenCL Working Group Member). March 19, 2014. Video.
SYCL for OpenCL, Andrew Richards (CEO Codeplay & Chair SYCL Working group). March 19, 2014, GDC 2014.
Fusing GPU kernels within a novel single-source C++ API, Ralph Potter, Paul Keir, Jan Lucas, Maurico Alvarez-Mesa, Ben Juurlink and Andrew Richards. January 20, 2014, LPGPU Workshop on Power-Efficient GPU and Many-core Computing (PEGPUM 2014).
Fusing GPU kernels within a novel single-source C++ API, Ralph Potter, Paul Keir, Jan Lucas, Mauricio Alvarez-Mesa, Ben Juurlink, Andrew Richards. November 18, 2013, Intel Compiler, Architecture and Tools Conference.

There are also many interesting articles in the publication list from Codeplay.

4.4 Related projects

CLHPP: The OpenCL C++ wrapper from Khronos around host API
Boost.Compute
VexCL
ViennaCL
C++ ISO/IEC JTC1/SC22/WG21 WG21 committee
- the SG14 subgroup on low Latency, real time requirements, performance, efficiency, heterogeneous computing, where SYCL is one of the candidates;
- C++ Parallelism TS https://github.com/cplusplus/parallelism-ts
  
  SYCL Parallel STL is an implementation of the Parallel STL of C++17 based on SYCL;
- C++ Concurrency TS https://github.com/cplusplus/concurrency_ts
OpenMP is a #pragma-based standard to express different kind of parallelism with accelerators supported since version 4.0;
OpenACC is a #pragma-based extension targetting accelerators;
Bolt
Thrust
C++AMP
HCC https://bitbucket.org/multicoreware/hcc/wiki/Home
GOOPAX is a product providing a C++11 framework for single-source OpenCL;
PACXX is a higher-level C++ compiler and framework for accelerators;
Intel SPMD Program Compiler https://ispc.github.io/
Intel Lab's iHRC https://github.com/IntelLabs/iHRC
CUDA
Metal

5 OpenCL triSYCL code documentation

The documentation of the triSYCL implementation itself can be found in http://xilinx.github.io/triSYCL/Doxygen/triSYCL/html and http://xilinx.github.io/triSYCL/Doxygen/triSYCL/triSYCL-implementation-refman.pdf

6 Installation

Only Clang 3.9+ or GCC 5.4+, Boost.MultiArray (which adds to C++ the nice Fortran array semantics and syntax), Boost.Operators and a few other Boost libraries are needed.

To install them on latest Linux Debian/unstable (this should work on latest Ubuntu too, just adapt the compiler versions):

sudo apt-get install clang-3.9 g++-6 libboost-dev

There is nothing else to do for now to use the include files from triSYCL include directory when compiling a program. Just add a -I.../include option and -std=c++1y when compiling.

triSYCL is configurable through preprocessor macros described in macros.

Also use -fopenmp if you want to use multicore parallelism on the CPU.

The CMake support is described in doc/cmake.rst.

7 Examples and tests

There are simple examples and tests in the tests directory. Look at tests/README.rst description.

7.1 Generating the Doxygen documentation

In the top directory, run

make

that will produce tmp/Doxygen/SYCL with the API documentation and tmp/Doxygen/triSYCL with the documented triSYCL implementation source code.

To publish the documentation on GitHub:

make publish

and finish as explained by the make output.

8 Possible futures

Some ideas of future developments where you can contribute too: :-)

finish implementation of basic classes without any OpenCL support;
move to CMake for better portability (status: Lee Howes has made it on 1 of his private branches. To be merged...);
improve the test infrastructure (for example move to something more standard with Boost.Test. Status: started);
use the official OpenCL SYCL test suite to extend/debug/validate this implementation;
add vector swizzle support by following ideas from https://github.com/gwiazdorrr/CxxSwizzle http://glm.g-truc.net http://jojendersie.de/performance-optimal-vector-swizzling-in-c http://www.reedbeta.com/blog/2013/12/28/on-vector-math-libraries ;
add first OpenCL support with kernels provided only as strings, thus avoiding the need for a compiler. Could be based on other libraries such as Boost.Compute, VexCL, ViennaCL... (status: started with Boost.Compute);
make an accelerator version based on OpenMP 4 accelerator target, OpenHMPP or OpenACC;
make an accelerator version based on wrapper classes for the C++AMP Open Source compiler.

Extend the current C++AMP OpenCL HSA or SPIR back-end runtime to expose OpenCL objects needed for the SYCL OpenCL interoperability. This is probably the simpler approach to have a running SYCL compiler working quickly.

The main issue is that since C++AMP support is not yet integrated in the official trunk, it would take a long time to break things down and be reviewed by the Clang/LLVM community. Actually, since Microsoft is no longer pushing this project and there are some design issues in the language requiring a lot of change to the C++ parser, it will probably never be up-streamed to Clang/LLVM;
extend runtime and Clang/LLVM to generate OpenCL/SPIR from C++ single-source kernels, by using OpenMP outliner. Starting from an open source OpenCL C/C++ compiler sounds great;
alternatively develop a Clang/LLVM-based version, recycling the outliner which is already present for OpenMP support and modify it to generate SPIR. Then build a specific version of libiomp5 to use the OpenCL C/C++ API to run the offloaded kernels. See https://drive.google.com/file/d/0B-jX56_FbGKRM21sYlNYVnB4eFk/view and the projects https://github.com/clang-omp/libomptarget for https://github.com/clang-omp/llvm_trunk and https://github.com/clang-omp/clang_trunk

This approach may require more work than the C++AMP version but since it is based on the existing OpenMP infrastructure Intel spent a lot of time to upstream through the official code review process, at the end it would require quite less time for up-streaming, if this is the goal.

OpenMP4 in Clang/LLVM is getting momentum and making lot of progress backed by Intel, IBM, AMD... so it sounds like a way to go;
recycle the GCC https://gcc.gnu.org/wiki/Offloading OpenMP/OpenACC library infrastructure to construct an OpenCL interoperability API and adapt the triSYCL classes to leverage OpenMP/OpenACC;
add OpenCL 2.x support with SYCL 2.x;
since SYCL is a pretty general programming model for heterogeneous computing, if the OpenCL compatibility layer is not required, some other back-ends could be written besides the current OpenMP one: CUDA, RenderScript, etc.
SYCL concepts (well, classes) can also be ported to some other languages to provide heterogeneous support: SYJSCL, SYCamlCL, SYJavaCL... It is not clear yet if SYFortranCL is possible with Fortran 2008 or 2015+.

ville-k/triSYCL

triSYCL