Disscusion about improving compile time.
lh-ycx opened this issue · 0 comments
Hey guys, I noticed that compiling an op from scratch will take 1~2 minutes. This could be a problem when using AIT to compile a huge graph (like LLM).
For example, previously I compiled a gpt-2 backward graph (of which 146 ops are compiled with AIT) and the time spent on AIT compilation is ~40 mins. I performed some profiling work and identified that compiling the profiling obj would spend >75% of the total compile time. The reason is that it would compile 32 kernels serially (all the source code is put in one .cu source file).
Theoretically, splitting these 32 kernels into separate files would increase the parallelism thus significantly improving the compile time.
Hereby I submit this issue to ask for your opinions. Would this approach work? If not, are there any other possible approaches? Thanks.