How to specify serveral architectures to be genrated

Question

How to specify serveral architectures to be genrated

littlewu2508 opened this issue 3 years ago · 5 comments

Hi! I'm developing rocBLAS package for Gentoo, and everything goes well, except for the long compilation time of Tensile code objects (both source and asm kernels) using asm_full library logic. I noticed that by default all architectures are enabled, and generate all .co and .hsaco; by specifying --architecture='gfx803', only TensileLibrary_gfx803.co is compiled, but clang compiles source kernels into all architecture, which takes extra long time. Also, by passing this argument I can't generate asm kernels for more than one architecture.

Since users know what GPU they're using, is there a way to specify the EXACT code object architectures to be generated? For example, I want gfx906 and gfx908:xnack-, and TensileCreateLibrary will only collect source solutions and gfx906, gfx908 solutions, and compile them into TensileLibrary_gfx906.co, TensileLibrary_gfx908.co, Kernels.so-000-gfx906.hsaco and Kernels.so-000-gfx908-xnack-.hsaco

Answer 1 · 2021-08-13T17:14:16.000Z

I did a bit of work on this, because I also find myself waiting for builds to complete.

I think the architectures for the source kernels are chosen within buildSourceCodeObjectFile() from Tensile/TensileCreateLibrary.py. If we can pipe the requested architectures down to there, we can fix the source kernel aspect of this problem.

Answer 2 · 2021-08-17T18:16:50.000Z

@littlewu2508 your request is very reasonable, especially given that a full rocBLAS build can take a long time. As of ROCm 4.3, the -a install.sh option can take a subset of gfx values as you requested, e.g. -a "gfx906:xnack-;gfx908:xnack-", and the rocBLAS build will honor the list of gfx values you specified - almost.

On a gfx906 test box with nproc=64 and 256GB main memory using ROCm 4.3, here are some statistics: a full rocBLAS build took 86 minutes; a subset -a "gfx906:xnack-;gfx908:xnack-" build took 41 minutes.

However, currently the subset gfx values are not honored for the compilation of the humongous Kernels.cpp, which will be fixed in a future ROCm release. As it stands right now, whilt ROCm 4.3, a -a "gfx906:xnack-;gfx908:xnack-" build produces the following code object bundles:

$ ls *co
Kernels.so-000-gfx1010.hsaco  Kernels.so-000-gfx906-xnack-.hsaco
Kernels.so-000-gfx1011.hsaco  Kernels.so-000-gfx908-xnack-.hsaco
Kernels.so-000-gfx1012.hsaco  Kernels.so-000-gfx90a-xnack+.hsaco
Kernels.so-000-gfx1030.hsaco  Kernels.so-000-gfx90a-xnack-.hsaco
Kernels.so-000-gfx803.hsaco   TensileLibrary_gfx906.co
Kernels.so-000-gfx900.hsaco   TensileLibrary_gfx908.co

while we would like to see (as you indicated):

$ ls *co
Kernels.so-000-gfx906-xnack-.hsaco  TensileLibrary_gfx906.co
Kernels.so-000-gfx908-xnack-.hsaco  TensileLibrary_gfx908.co

With a development version of Tensile that honors the subset gfx values when compiling Kernels.cpp, the subset -a "gfx906:xnack-;gfx908:xnack-" rocBLAS build time was further reduced to 36 minutes from the current 41 minutes.

Hope this response helps you. Thank you for your comments.

Answer 3 · 2021-08-18T07:34:09.000Z

Thanks! I have just looked for the rocm-4.3 release and found that choosing multiple architecture is implemented. And I hope specifiying offload-arch on compiling Kernels.cpp be implemented, soon. I would like to have a try and give a PR if possible.

Answer 4 · 2021-08-24T09:38:12.000Z

Although I've figured out a hack, #1398 works so I'll use this one. Thanks!

Answer 5 · 2021-10-03T21:05:29.000Z

I meant to provide some feedback on this, but it slipped my mind until I saw #1418 (which I also expect to enable major improvements in rocBLAS build times).

I did not time the difference that this change made, but I could tell when it was first included into rocBLAS develop because the improvement announced its arrival. @zaliu measured that the rocBLAS build time was reduced from 41 minutes to 36 minutes on his test system, but the change could be much more dramatic than that, depending on what architectures you had specified and what level of parallelism you set in your hipcc flags.

Thanks a lot! This was a significant step towards more reasonable build times for rocBLAS.