/cython-sse-example

Simple example for embedding SSE2 assembly in Cython projects

Primary LanguagePythonThe UnlicenseUnlicense

cython-sse-example

Simple example for embedding SSE2 assembly in Cython projects

Introduction

As it says on the tin. The embedding is done via using the SIMD intrinsics.

The purpose of this project is to provide documentation, since this information was a bit hard to find.

Be aware that in the case of a simple loop over data, performing the same manipulation on each array element, GCC with the compiler flags -march=native -mfpmath=sse -msse -msse2 -O2 is pretty good at generating SSE code. Manually tuned SSE for such trivial tasks might actually be slower (and less maintainable, and specific to a CPU architecture) than just writing (sensible) plain C and letting GCC take care of the performance optimization.

Only more advanced algorithms designed specifically for low-level vectorization are likely to see significant (or indeed any) benefit.

If you still wish to access SSE manually, read on.

The __m128d datatype

Defining the __m128d datatype in Cython is the tricky part. The answer can be found in this thread on cython-users (search terms used: "cython typedef sse").

Cython needs to be told, within the constraints of its syntax, that __m128d behaves like a double.

Then, the C compiler needs the exact definition. To get the Cython-generated C code to #include it, the ctypedef must be cdef extern from'd from the original header.

See sse_demo.pyx.

Further reading

General notes on SSE in Cython can be found in this cython-users thread (search terms: "cython sse").

If the SSE part can be isolated in its own C file, there is another approach that can be used.

On SSE in general:

License

The Unlicense