Tantalus13A98B5F/convbench

About template metaprogramming

Closed this issue · 1 comments

Two meta-classes are employed in conv0.cpp

  • class DimIdx, to extract the dimension info stored in std::vector. This part works well that neither vector accessing nor index computation is becoming hotspot;
  • class Range and class RangeUnroll to express loop unrolling & loop tiling. They are a disaster.

First of all, no control flow divergence should occur in a piece of code you'd like the compiler to inline for you. if-else will not do, but consecutive loops are okay -- good way to cheat the compiler into what you really want. Then here's to the major problem -- how's the compiler doing in the post-processing.

I've tried ICC 19.1. Guided loop unrolling is producing +50% mem access than the automated one. Repeated access to the same address is maintained. Why is that?

g++ is another pig, taking >4x time to finish. 9.3.0 is taking >5x. clang++ is far better, but Intel is still doing the best.

I'm ending this topic...

  • used an explicit Storage class to extract data to variables;
  • then there should be a way to do unrolling statically (indices are passed to template argument);
  • currently manually unrolled, in 1D only; mem-reading reduced by 30%, but only a little bit faster; it's not critical;
  • how to write peeling-off can greatly impact the code generated;
  • further chances in vectorization, which have to be hand-written;
  • it seems that the compiler prefers nested loops than unrolled instructions; thus naive unrolling should be considered as unhelpful;