AuburnSounds/intel-intrinsics

Bench pragma(inline) on real project (debug performance)

Closed this issue · 6 comments

p0nce commented

When it comes to debug performance, bench how much pragma(inline) would slow down compilation, and how much it would benefits performance. Because it's really quite slow and worse than assembly in debug mode.

p0nce commented

Applying it at toplevel expose LDC bugs, so probably the most common intrinsics is a better idea.

p0nce commented

Basic counting of occurences for a few SIMD codes:

_mm_loadu_ps xxxxxxxxxxxxxxxxx
_mm_set1_ps xxxxxxxxxxxxxx
_mm_mulhi_epu16 xxxxxxxxx

_mm_sub_ps xxxxxxxx

_mm_add_ps xxxxxxx
_mm_setr_pd xxxxxxx
_mm_storeu_ps xxxxxxx

_mm_add_pd xxxxx
_mm_unpacklo_epi8 xxxxx
_mm_cvtepi32_ps xxxxx
_mm_packs_epi32 xxxxx
_mm_load_ps xxxxx
_mm_mul_ps xxxxx

_mm_add_epi32 xxxx
_mm_slli_si128 xxxx
_mm_unpackhi_epi8 xxxx
_mm_setzero_si128 xxxx
_mm_and_ps xxxx
_mm_storeu_si128 xxxx
_mm_packus_epi16 xxxx
_mm_loadu_si128 xxxx

_mm_set1_epi32 xxx
_mm_slli_epi16 xxx
_mm_shuffle_pd xxx
_mm_setzero_ps xxx
_mm_cmple_ps xxx

_mm_setr_ps xx
_mm_srli_ps xx
_mm_cvtps_pd xx
_mm_loadu_pd xx
_mm_mul_pd xx
_mm_cvttps_epi32 xx
_mm_set1_pd xx
_mm_srli_epi16 xx
_mm_setr_epi32 xx
_mm_set1_epi16 xx
_mm_xor_si128 xx
_mm_add_epi16 xx
_mm_sub_epi16 xx
_mm_subs_epi16 xx
_mm_adds_epi16 xx

_mm_storeu_pd x
_mm_shuffle_ps x
_mm_storel_epi64 x
_mm_loadl_epi64 x
_mm_cmplt_ps x
_mm_movemask_ps x
_mm_unpacklo_epi32 x
_mm_cmpeq_epi32 x
_mm_storeu_si32 x
_mm_store_ps x

_mm_max_ps x
_mm_shuffle_epi32 x
_mm_min_ps x
_mm_cvtps_pd x
_mm_sub_pd x
_mm_unpacklo_epi64 x
_mm_loadu_si32 x
_mm_unpacklo_epi16 x
_mm_unpackhi_epi32 x
_mm_adds_epu8 x
_mm_srai_epi32 x
_mm_srai_epi16 x
_mm_and_si128 x
p0nce commented

Basic test gives almost no build overhead, 6kb larger binary, for a +2.6 speedup.

The set of intrinsics to force inline:

  • add, sub, mul, div for float, double, epi8, epi16, epi32
  • basic load and load and store and storeu intrinsics.
  • set1
  • setr
  • set
  • setzero
  • _mm_undefined_xxxx
  • and / or / xor intrinsics
    Do this carefully and one by one.
p0nce commented

The problem to solve is that running a CPU-intensive software with intel-intrinsics in debug mode is annoying, since it is slow.

Some pragma(inline, true) are enabled in v1.6.1+

Results of comparing DMD and LDC, before and after pragma(inline):

Build times:

  • LDC without pragma(inline) => 17 secs
  • LDC with pragma(inline) => 17 secs
  • DMD without pragma(inline) => 3 secs
  • DMD with pragma(inline) => 3 secs

=> DMD builds the debug build 5.6x quicker than LDC.

Run Times:

  • LDC without pragma(inline) => baseline
  • LDC with pragma(inline) => +5.1% quicker than baseline
  • DMD without pragma(inline) => 66% slower than baseline
  • DMD with pragma(inline) => 67% slower than baseline, some

tl;dr LDC debug builds go 5% faster. force inline does nothing or is a bit harmful for DMD.
At runtime something built with LDC up to three times quicker.

Binary Size:

  • LDC without pragma(inline) => 3080kb
  • LDC with pragma(inline) => 3089kb
  • DMD without pragma(inline) => 6851kb
  • DMD with pragma(inline) => 6870kb
p0nce commented

Well to have those inlined in VisualD you need --combined
Using DMD -O -inline is not competitive with LDC debug build.
asm not that bad if too much time spent in debug mode...

p0nce commented

Probably exposed one DMD bug. Avoid pragma(inline) on DMD I guess.