geofflangdale/simdcsv

bad msvc performance

Opened this issue · 10 comments

I've proposed a PR (#10) that allows to compile on MSVC.

However the performance is awful (0.65GB/sec instead of 11GB/sec on the same machine, with MSVC Clang)

If anybody is able to look into that.

Side note: it has nothing to do with __builtin_prefetch. MSVC Clang accepts it, does not change results.

Incidental findings: defining
#define really_inline __declspec(noinline)
MSVC perfomance jumps to 5.6GB/sec. Much better.

The use case for __declspec(noinline) are functions that are small (so likely inlined) but rarely used. It does not make intuitive sense that it would help performance.

In simdjson, we have

 #define simdjson_really_inline __forceinline

In simdcsv, we have...

#ifdef _MSC_VER


#define really_inline inline
#define never_inline __declspec(noinline)

Yes, discovered this by chance. MSVC results are counter-intuitive, something's wrong in their inlining.

__forceinline and inline give same results.

I tried noinline to get a split of function time while profiling...

it is find_quote_mask + cmp_mask_against_input that I need to un-inline

something's wrong in their inlining.

Though we have not worked much on simdcsv, we have worked much with simdjson and Visual Studio and I have found that, indeed, something is very wrong with the code generation. At first, we thought that Visual Studio would ignore use and not inline, but it does.

The one strange thing I have spotted in the generated code is that Visual Studio seems to sometimes get confused and copy all the registers to stack and then immediately bring them back from stack. I reported the issue to Microsoft and we even chatted with some of the people from the core Visual Studio compiler team. Unfortunately, I do not think that they investigated the issue, and I ended up giving up and moving on (after reporting the issue in details).

You can see that I did a lot of work...

simdjson/simdjson#847 (comment)

You can try manually inlining the functions... it should work. To be clear, manually inlining the functions give better results then relying on the compiler, under visual studio.

@reder2000 Are you able to look at the produce assembly? It could be fun to compare before/after inlining. Just do an assembly dump. I predict that you will see the registering/deregistering issue.

Note that if you use ClangCL, the problem goes away.

References

simdjson/simdjson#847 (comment)

it is find_quote_mask + cmp_mask_against_input that I need to un-inline

Or try manually inlining them!

I've been playing a while with it. It turns out that MSVC performance is worst with /Ob2 than /Ob1.
With /Ob1 performance is 50% of Clang-CL at /Ob2 (and same as CLang-CL at /Ob1)
I tried replacing function with macros, one gets the /Ob2 (really bad) performance whatever the flag, so too much inlining is causing MSVC to act weirdly.
Those people seem to have the same kind of problems: xtensor-stack/xsimd#617

Those people seem to have the same kind of problems: xtensor-stack/xsimd#617

I think their problem is too little inlining.

Maybe I read that too quickly.

I'll submit the issue to MSFT. /Ob2 worst than /Ob1 is not good news. Even though this is a corner case.