Bench pragma(inline) on real project (debug performance)

Question

Bench pragma(inline) on real project (debug performance)

Closed this issue 3 years ago · 6 comments

When it comes to debug performance, bench how much pragma(inline) would slow down compilation, and how much it would benefits performance. Because it's really quite slow and worse than assembly in debug mode.

Answer 1 · 2021-10-22T13:21:50.000Z

Applying it at toplevel expose LDC bugs, so probably the most common intrinsics is a better idea.

Answer 2 · 2021-10-22T13:46:39.000Z

Basic counting of occurences for a few SIMD codes:

_mm_loadu_ps xxxxxxxxxxxxxxxxx
_mm_set1_ps xxxxxxxxxxxxxx
_mm_mulhi_epu16 xxxxxxxxx

_mm_sub_ps xxxxxxxx

_mm_add_ps xxxxxxx
_mm_setr_pd xxxxxxx
_mm_storeu_ps xxxxxxx

_mm_add_pd xxxxx
_mm_unpacklo_epi8 xxxxx
_mm_cvtepi32_ps xxxxx
_mm_packs_epi32 xxxxx
_mm_load_ps xxxxx
_mm_mul_ps xxxxx

_mm_add_epi32 xxxx
_mm_slli_si128 xxxx
_mm_unpackhi_epi8 xxxx
_mm_setzero_si128 xxxx
_mm_and_ps xxxx
_mm_storeu_si128 xxxx
_mm_packus_epi16 xxxx
_mm_loadu_si128 xxxx

_mm_set1_epi32 xxx
_mm_slli_epi16 xxx
_mm_shuffle_pd xxx
_mm_setzero_ps xxx
_mm_cmple_ps xxx

_mm_setr_ps xx
_mm_srli_ps xx
_mm_cvtps_pd xx
_mm_loadu_pd xx
_mm_mul_pd xx
_mm_cvttps_epi32 xx
_mm_set1_pd xx
_mm_srli_epi16 xx
_mm_setr_epi32 xx
_mm_set1_epi16 xx
_mm_xor_si128 xx
_mm_add_epi16 xx
_mm_sub_epi16 xx
_mm_subs_epi16 xx
_mm_adds_epi16 xx

_mm_storeu_pd x
_mm_shuffle_ps x
_mm_storel_epi64 x
_mm_loadl_epi64 x
_mm_cmplt_ps x
_mm_movemask_ps x
_mm_unpacklo_epi32 x
_mm_cmpeq_epi32 x
_mm_storeu_si32 x
_mm_store_ps x

_mm_max_ps x
_mm_shuffle_epi32 x
_mm_min_ps x
_mm_cvtps_pd x
_mm_sub_pd x
_mm_unpacklo_epi64 x
_mm_loadu_si32 x
_mm_unpacklo_epi16 x
_mm_unpackhi_epi32 x
_mm_adds_epu8 x
_mm_srai_epi32 x
_mm_srai_epi16 x
_mm_and_si128 x

Answer 3 · 2021-10-22T13:59:14.000Z

Basic test gives almost no build overhead, 6kb larger binary, for a +2.6 speedup.

The set of intrinsics to force inline:

Answer 4 · 2021-10-22T15:20:14.000Z

The problem to solve is that running a CPU-intensive software with intel-intrinsics in debug mode is annoying, since it is slow.

Some pragma(inline, true) are enabled in v1.6.1+

Results of comparing DMD and LDC, before and after pragma(inline):

Build times:

LDC without pragma(inline) => 17 secs
LDC with pragma(inline) => 17 secs
DMD without pragma(inline) => 3 secs
DMD with pragma(inline) => 3 secs

=> DMD builds the debug build 5.6x quicker than LDC.

Run Times:

LDC without pragma(inline) => baseline
LDC with pragma(inline) => +5.1% quicker than baseline
DMD without pragma(inline) => 66% slower than baseline
DMD with pragma(inline) => 67% slower than baseline, some

tl;dr LDC debug builds go 5% faster. force inline does nothing or is a bit harmful for DMD.
At runtime something built with LDC up to three times quicker.

Binary Size:

LDC without pragma(inline) => 3080kb
LDC with pragma(inline) => 3089kb
DMD without pragma(inline) => 6851kb
DMD with pragma(inline) => 6870kb

Answer 5 · 2021-10-22T15:57:46.000Z

Well to have those inlined in VisualD you need --combined
Using DMD -O -inline is not competitive with LDC debug build.
asm not that bad if too much time spent in debug mode...

Answer 6 · 2021-11-03T15:06:12.000Z

Probably exposed one DMD bug. Avoid pragma(inline) on DMD I guess.