cldellow/sqlite-parquet-vtable

consider profile-guided optimizations

cldellow opened this issue · 4 comments

There are lots of dumb performance sinkholes. eg inlining currentRowSatisfiesFilter speeds up the code by 10% in some cases. There's probably no reason not to inline, as it was broken out only for readability and has only one caller. Still, finding these by hand will suck.

Go and read https://stackoverflow.com/questions/4365980/how-to-use-profile-guided-optimizations-in-g and see if we can apply it.

A profile generated from running tests/test-all results in:

  • 10% on the census cyclist query, 42ms -> 38ms
  • 11% on select count(*) from census where profile_id = '1930' 444ms -> 393ms
  • 4% on select count(*) from census where profile_id = 1930 1920ms -> 1860ms

We could probably improve this by putting more realistic queries in the test dataset. Maybe Mark Litwintschik's blog post w/benchmarks could be used for data?

Anyway, worth pursuing as part of an automated release system.

This is only PGO on the vtable implementation?

I would be very interested if it makes a difference on parquet-cpp too.

Also I usually profile a bit with perf and generate http://www.brendangregg.com/FlameGraphs/cpuflamegraphs.html This might also be helpful.

Yes, this is just on the vtable code. I'll have a look at trying it on the rest of the toolchain, too.

Thanks for the reference to flame graphs.

Enabling it on the parquet only didn't change anything, enabling it on parquet and arrow actually regressed it significantly. :( I can see the .gcda files being created and I'm pretty sure the make output shows it's picking up the correct libs. The enthusiast in me wants to dig in further, but given my relative lack of C++ experience, I think I'll have to put that on the shelf for now. :) Timings for posterity are below.

On gcc-5.5, vtable, parquet-cpp and arrow (via the -DCMAKE_BUILD_TYPE=profile_gen and DCMAKE_BUILD_TYPE=profile_build cmake flags):

  • '1930' -> 600ms
  • 1930 -> 2200ms

On gcc-7.3, vtable, parquet-cpp and arrow:

  • '1930' -> 600ms
  • 1930 -> 2000ms

On gcc-7.3, just vtable:

  • '1930' -> 376ms
  • 1930 -> 1810ms