AuburnSounds/intel-intrinsics

Idea: UDAs to automatically do inlining with LDC

Opened this issue · 6 comments

It is cumbersome / problematic to have to pass --enable-cross-module-inlining to LDC to get good performance; this may not be an option in a larger project.
Perhaps adding a UDA to each function can be used to force inlining of functions that are known to optimize to single (or a few) instructions?

Something like:

static if (SSE2_with_LDC) { 
  import ldc.attributes;
  enum SSE2_inline = llvmAttr("alwaysinline");
} else static if (SSE2_with_GDC) { 
  // GDC probably has some UDA for this
} else {
  alias SSE2_inline = void; // Don't force inlining for emulated cases.
}

@SSE2_inline
void _mm_some_SSE2_intrinsic(){}

In case this does not help with cross module inlining, then perhaps it will if every function is turned into a template (with zero arguments)

p0nce commented

Can pragma(inline, bool) be used for the same effect? (some of the most used intrinsics are marked pragma(inline, true) )

pragma(inline, ...) will work, but I don't know how to conditionally apply it. (and it is better to put it on the outside than on inside, because then perhaps we force codegen..?) If it is a template, then semantic analysis will happen and it can be put inside aswell (that is the easiest I suppose...)

p0nce commented

When it was introduced pragma(inline) was supposed to take a compile-time parameter as argument (to inline or not) and I've never heard that you could put it before the function instead of inside.

I'd prefer pragma(inline) inside; thus you can choose for which compiler+instruction set you want to inline the function (typically this exposed DMD bugs as soon as pragma(inline) was introduced) also the LDC_with_arm64 could be very long but LDC_with_SSE42 could be very short, and they will need different inlining choices. Thus the UDA solution I'm not a fan.

p0nce commented

and it is better to put it on the outside than on inside, because then perhaps we force codegen..?

I don't get this.

If the function is not a template and is called from another module, the compiler will not do semantic analysis of the function body and will not see the pragma(inline) inside; so in general I recommend putting it outside the function (very often it is wrongly applied inside and has no effect). The problem with the pragma is that you cannot make it default to nothing. It is either a forced inline, or a forced not-inline. Perhaps all intrinsics should be templates, but due to template culling I'm not sure if codegen will happen. But at least semantic analysis is probably always happening, i.e. the compiler will see the pragma(inline) inside. (I'm not 100% sure)

p0nce commented

In general my feeling about intel-intrinsics (vs vanilla code) currently is, from worst problem to least problem:

  1. slower in debug (it builds slower) with LDC. Tricks to enhance this are welcome!
  2. slower in debug (it runs slower) with LDC. Though you can also win vs vanilla even without optimizations sometimes.
  3. other problems, such as avoiding DMD codegen problems

If forced inlining makes (2) better it should also not make (1) worse.

If you have ideas to make intrinsics code builds faster they are welcome (thought inlining, templating, or else).
Some code is voluntarily templated to avoid being generated (stricmp emulation).