tlc-pack/tvm-tensorir

[Bug] Performance bug: SampleInitPopulation

Closed this issue · 6 comments

I am encountering a performance issue in EvolutionaryNode::SampleInitPopulation. Some perf data:

  • SampleInitPopulation: ~26s
  • EvolveWithCostModel: ~14s
  • Build & Measure: ~66s

It is not reasonable that SampleInitPopulation is much slower than EvolveWithCostModel. In theory it should be like 5-10x faster than EvolveWithCostModel.

Glancing through htop, I noticed that there are only use 8 threads active when executing SampleInitPopulation, which is supposed to be 32 threads on my AMD 3950x (16C/32T).

There is only one lock in the code, which is very much unlikely to affect performance, because it is only acquired 2048 times during this 26s.

Therefore I opened this thread in case I forgot. Will dig a bit deeper later.

going to work on it

The post processor VerifyGPU is the cause of the issue

The pass VerifyGPUCode is the root cause

The root cause is the exception try-catch pass is really slow...Using the helper provided in tir analysis, we can reduce the time from ~26s to ~8s

Amazing that try-catch can make such huge impact!