[Bug]: rocblas link fails with relocation R_X86_64_PC32 out of range
aagit opened this issue ยท 28 comments
Describe the bug
Build fails during final shared lib linking.
To Reproduce
Steps to reproduce the behavor:
- build rocblas version 6.0.2 with export ROCM_GPUS="gfx803;gfx900;gfx906:xnack-;gfx908:xnack-;gfx90a:xnack+;gfx90a:xnack-;gfx940;gfx941;gfx942;gfx1010;gfx1012;gfx1030;gfx1100;gfx1101;gfx1102"
- cmake -G Ninja
-DBUILD_FILE_REORG_BACKWARD_COMPATIBILITY=OFF
-DROCM_SYMLINK_LIBS=OFF
-DHIP_PLATFORM=amd
-DAMDGPU_TARGETS=${ROCM_GPUS}
-DCMAKE_INSTALL_LIBDIR=$ROCM_LIB
-DCMAKE_INSTALL_BINDIR=$ROCM_BIN
-DBUILD_WITH_TENSILE=ON
-DBUILD_WITH_PIP=OFF - See error
Expected behavior
Build should not fail.
Log-files
ld.lld: error: library/src/librocblas.so.4.0:(.eh_frame+0x7f991): relocation R_X86_64_PC32 out of range: -2179713377 is not in [-2147483648, 2147483647]; references section '.gcc_except_table.rocblas_gemm_ex3'
ld.lld: error: library/src/librocblas.so.4.0:(.eh_frame+0x7fa49): relocation R_X86_64_PC32 out of range: -2179713445 is not in [-2147483648, 2147483647]; references section '.gcc_except_table._Z27exception_to_rocblas_statusNSt15__exception_ptr13exception_ptrE'
ld.lld: error: library/src/librocblas.so.4.0:(.eh_frame+0x7fa75): relocation R_X86_64_PC32 out of range: -2179713457 is not in [-2147483648, 2147483647]; references section '.gcc_except_table._Z11log_profileIJRA7_KcRPS0_S2_S4_S2_S4_S2_S4_RA13_S0_S4_S2_RcS2_S7_RA2_S0_RiS9_SA_S9_SA_RA6_S0_dRA4_S0_SA_SE_SA_RA5_S0_dSE_SA_SE_SA_SG_R18rocblas_gemm_algo_RA15_S0_SA_SC_19rocblas_gemm_flags_EEvP15_rocblas_handleS3_DpOT_'
ld.lld: error: library/src/librocblas.so.4.0:(.eh_frame+0x7faa1): relocation R_X86_64_PC32 out of range: -2179713477 is not in [-2147483648, 2147483647]; references section '.gcc_except_table._ZNSt8__detaillsIcSt11char_traitsIcEEERSt13basic_ostreamIT_T0_ES7_RKNS_14_Quoted_stringIPKS4_S4_EE'
ld.lld: error: library/src/librocblas.so.4.0:(.eh_frame+0x7fad5): relocation R_X86_64_PC32 out of range: -2179713485 is not in [-2147483648, 2147483647]; references section '.gcc_except_table._Z22log_bench_scalar_valueIfLi0EENSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEEPKcPKT_'
ld.lld: error: library/src/librocblas.so.4.0:(.eh_frame+0x7fb05): relocation R_X86_64_PC32 out of range: -2179713501 is not in [-2147483648, 2147483647]; references section '.gcc_except_table._ZN16argument_profileISt5tupleIJPKcS2_S2_21rocblas_atomics_mode_S2_S2_S2_S2_S2_S2_S2_S2_S2_S2_S2_cS2_cS2_iS2_iS2_iS2_dS2_iS2_iS2_dS2_iS2_iS2_18rocblas_gemm_algo_S2_iS2_19rocblas_gemm_flags_EEED2Ev'
ld.lld: error: library/src/librocblas.so.4.0:(.eh_frame+0x7fb3d): relocation R_X86_64_PC32 out of range: -2179713529 is not in [-2147483648, 2147483647]; references section '.gcc_except_table._ZN16argument_profileISt5tupleIJPKcS2_S2_21rocblas_atomics_mode_S2_S2_S2_S2_S2_S2_S2_S2_S2_S2_S2_cS2_cS2_iS2_iS2_iS2_dS2_iS2_iS2_dS2_iS2_iS2_18rocblas_gemm_algo_S2_iS2_19rocblas_gemm_flags_EEEclEOS6_'
ld.lld: error: lto.tmp:(.rodata._Z44rocblas_copy_alpha_beta_to_host_if_on_deviceIvE15rocblas_status_P15_rocblas_handleRPKT_S6_R15rocblas_union_uS8_i20rocblas_computetype_+0x0): relocation R_X86_64_PC32 out of range: 2180923961 is not in [-2147483648, 2147483647]; references section '.text._Z44rocblas_copy_alpha_beta_to_host_if_on_deviceIvE15rocblas_status_P15_rocblas_handleRPKT_S6_R15rocblas_union_uS8_i20rocblas_computetype_'
>>> referenced by rocblas_gemm_ex3.cpp
ld.lld: error: lto.tmp:(.gcc_except_table._Z27exception_to_rocblas_statusNSt15__exception_ptr13exception_ptrE+0x14): relocation R_X86_64_PC32 out of range: 2192096096 is not in [-2147483648, 2147483647]; references section '.data'
>>> referenced by rocblas_gemm_ex3.cpp
ld.lld: error: lto.tmp:(.rodata._Z44rocblas_copy_alpha_beta_to_host_if_on_deviceIvE15rocblas_status_P15_rocblas_handleRPKT_S6_R15rocblas_union_uS8_i20rocblas_computetype_+0x4): relocation R_X86_64_PC32 out of range: 2180924065 is not in [-2147483648, 2147483647]; references section '.text._Z44rocblas_copy_alpha_beta_to_host_if_on_deviceIvE15rocblas_status_P15_rocblas_handleRPKT_S6_R15rocblas_union_uS8_i20rocblas_computetype_'
>>> referenced by rocblas_gemm_ex3.cpp
ld.lld: error: lto.tmp:(.gcc_except_table._Z27exception_to_rocblas_statusNSt15__exception_ptr13exception_ptrE+0x18): relocation R_X86_64_PC32 out of range: 2192096100 is not in [-2147483648, 2147483647]; references section '.data'
>>> referenced by rocblas_gemm_ex3.cpp
ld.lld: error: lto.tmp:(function rocblas_gemm_ex3: .text.rocblas_gemm_ex3+0x7fa): relocation R_X86_64_PC32 out of range: -2180446035 is not in [-2147483648, 2147483647]; references '.L.str.36'
>>> referenced by rocblas_gemm_ex3.cpp:176 (/root/rpmbuild/BUILD/rocBLAS-rocm-6.0.2/library/src/blas_ex/rocblas_gemm_ex3.cpp:176)
>>> defined in lto.tmp
ld.lld: error: lto.tmp:(function rocblas_gemm_ex3: .text.rocblas_gemm_ex3+0x801): relocation R_X86_64_PC32 out of range: -2180875840 is not in [-2147483648, 2147483647]; references '.L.str.35'
>>> referenced by rocblas_gemm_ex3.cpp:176 (/root/rpmbuild/BUILD/rocBLAS-rocm-6.0.2/library/src/blas_ex/rocblas_gemm_ex3.cpp:176)
>>> defined in lto.tmp
ld.lld: error: lto.tmp:(.rodata._Z44rocblas_copy_alpha_beta_to_host_if_on_deviceIvE15rocblas_status_P15_rocblas_handleRPKT_S6_R15rocblas_union_uS8_i20rocblas_computetype_+0x8): relocation R_X86_64_PC32 out of range: 2180924159 is not in [-2147483648, 2147483647]; references section '.text._Z44rocblas_copy_alpha_beta_to_host_if_on_deviceIvE15rocblas_status_P15_rocblas_handleRPKT_S6_R15rocblas_union_uS8_i20rocblas_computetype_'
>>> referenced by rocblas_gemm_ex3.cpp
ld.lld: error: lto.tmp:(function rocblas_gemm_ex3: .text.rocblas_gemm_ex3+0x808): relocation R_X86_64_PC32 out of range: -2180734041 is not in [-2147483648, 2147483647]; references '.L.str'
>>> referenced by rocblas_gemm_ex3.cpp:176 (/root/rpmbuild/BUILD/rocBLAS-rocm-6.0.2/library/src/blas_ex/rocblas_gemm_ex3.cpp:176)
>>> defined in lto.tmp
ld.lld: error: lto.tmp:(function rocblas_gemm_ex3: .text.rocblas_gemm_ex3+0x80f): relocation R_X86_64_PC32 out of range: -2180325041 is not in [-2147483648, 2147483647]; references '.L.str.19'
>>> referenced by rocblas_gemm_ex3.cpp:176 (/root/rpmbuild/BUILD/rocBLAS-rocm-6.0.2/library/src/blas_ex/rocblas_gemm_ex3.cpp:176)
>>> defined in lto.tmp
ld.lld: error: lto.tmp:(.rodata._Z44rocblas_copy_alpha_beta_to_host_if_on_deviceIvE15rocblas_status_P15_rocblas_handleRPKT_S6_R15rocblas_union_uS8_i20rocblas_computetype_+0xc): relocation R_X86_64_PC32 out of range: 2180924253 is not in [-2147483648, 2147483647]; references section '.text._Z44rocblas_copy_alpha_beta_to_host_if_on_deviceIvE15rocblas_status_P15_rocblas_handleRPKT_S6_R15rocblas_union_uS8_i20rocblas_computetype_'
>>> referenced by rocblas_gemm_ex3.cpp
ld.lld: error: lto.tmp:(function rocblas_gemm_ex3: .text.rocblas_gemm_ex3+0x816): relocation R_X86_64_PC32 out of range: -2180325041 is not in [-2147483648, 2147483647]; references '.L.str.20'
>>> referenced by rocblas_gemm_ex3.cpp:176 (/root/rpmbuild/BUILD/rocBLAS-rocm-6.0.2/library/src/blas_ex/rocblas_gemm_ex3.cpp:176)
>>> defined in lto.tmp
ld.lld: error: lto.tmp:(function rocblas_gemm_ex3: .text.rocblas_gemm_ex3+0x836): relocation R_X86_64_PC32 out of range: -2180302935 is not in [-2147483648, 2147483647]; references '.L.str.37'
>>> referenced by rocblas_gemm_ex3.cpp:176 (/root/rpmbuild/BUILD/rocBLAS-rocm-6.0.2/library/src/blas_ex/rocblas_gemm_ex3.cpp:176)
>>> defined in lto.tmp
ld.lld: error: library/src/librocblas.so.4.0:(.eh_frame+0x7fb71): relocation R_X86_64_PC32 out of range: -2179713541 is not in [-2147483648, 2147483647]; references section '.gcc_except_table._ZNK24rocblas_internal_ostream3dupEv'
ld.lld: error: too many errors emitted, stopping now (use --error-limit=0 to see all errors)
clang: error: linker command failed with exit code 1 (use -v to see invocation)
ninja: build stopped: subcommand failed.
Environment
Should not matter, it is not a runtime issue.
Software | version |
---|---|
rocm-core | rocm-core-6.0.2-1.fc40.x86_64 |
rocblas | rocblas-6.0.2-3.fc40.x86_64 |
Additional context
Despite I don't see this reported among the github issues, this should be a very well known issues. So I wonder if this is planned not to be ever fixed?
If the above assumption is correct, I would like to know if upstream is willing to take in a fix for it, assuming a fix is possible.
You are running into the issue that ld can only link objects whos sections are at most a 32bit signed away from eatch other.
as you enable more targets rocblas gets larger eventually exceeding this limit. Yes this is a huge problem with how rocm is architectured and desperately needs some kind of resultion but for now the only solution is to build for less targets.
If you want to remove an architecture i would recommend gfx803 as this architecture is currently broken anyhow, unless you disable the asm kernels provided by tensile.
Thanks for the quick feedback.
Yes, if I'd build for fewer targets it would succeed, but I already removed gfx1103 as I've been building for a older codebase where gfx1103 could not be enabled. So removing gfx803 will hide the problem and it would kick the can down the road, but it doesn't appear a satisfactory long term solution.
If we don't work on a solution for this now the end result is that every rocm accelerated app binary has to be built multiple times against independent and incompatible rocm builds just as if they were separate GPU compute stacks with nothing in common. This multiplies also the build time and the disk space requirements of every app, maybe not xN, but close.
It would provide a sub par experience also to the end user that has then to figure the right binary to install invoke, instead of rocm solving that gpu detail in a way that is transparent to the end user.
jup, this is the major reason why rocm supports so few gpus, and if they dont address this soon it has the potential to sink rocm since it forces them to drop support for old gpus exreamly fast (ever accelerating in pace as rocm get larger even) which ultimately utterly destroys customer confidence.
@cgmb I think you had some other suggestions by using generic targets, but I can't remember how much progress has happened there.
Thank you for bringing this issue to our attention. We appreciate your feedback and suggestions.
We recommend building with the suggested targets in relation to the ROCm stack. The default target list for 6.0 includes:
- gfx900
- gfx906:xnack-
- gfx908:xnack-
- gfx90a:xnack+
- gfx90a:xnack-
- gfx940
- gfx941
- gfx942
- gfx1010
- gfx1012
- gfx1030
- gfx1100
- gfx1101
- gfx1102
The team is aware of the issue and is exploring possible solutions.
Thank you for your understanding and cooperation.
@cgmb I think you had some other suggestions by using generic targets, but I can't remember how much progress has happened there.
sure https://llvm.org/docs/AMDGPUUsage.html#amdgpu-generic-processor-table could be used at the cost of some performance for the non gfx10-3-generic targets. Ultimately this just kicks this can further down the road, but for now yes this would be sufficient.
right now there is also no support for ELFABIVERSION_AMDGPU_HSA_V6 so those targets dont work yet, but soon i presume.
Would it be possible to split the librocblas.so.4.0 in librocblas-gfx900.so.4.0 librocblas-gfx90a.so.4.0 librocblas-gfxXYZ.so.4.0... so each individual gfx target lands in a different shared library, and then have the main librocblas.so.4.0 dynamically load only the gfx targets available in hardware either during initialization of the main library or even better lazily on demand?
@aagit that separate gfx .so design has been evalutated as one possible solution but we are also looking at other strategies. For now until the full list of gfx that lands in a specific release requires a new build and packaging pattern we suggest you build and package the version specific set of gfx listed in the top level CMakeLists.txt. This corresponds to our build scripts default option.
I appreciate your suggestion above. I agree that's the least bad solution for the time being and I already gave it. If there's other ways to fix it, would you share them so they can be discussed here? Overall I would recommend to pick the simplest way to fix it and to ship it ASAP, because while working on a rocm accellerated app, I noticed that rocm has already been packaged in the open by building it N times and installing it in incompatible paths. The technical justification is to work around this issue (so it's like if there's a /opt/rocm1 /opt/rocm2 /opt/rocm3 /opt/rocmN installed, each one supporting a small subset of gfxes so that the link does not fail and gfx8 and gfx1103 can be enabled too). If the duplication was just on the rocm side it would be (perhaps) a lesser concern, but this causes all apps to be rebuilt N times and the build time is multiplied xN times. Last but not the least the end user would then have to pick the right binary (among N available) for its GPU or it won't work, and possibly just because of minor path differences. For example: I built an app linked against rocm that way and the total size of the N builds against N rocms, was 96GB. Then I run hardlink .
and it dropped the size to 92GB. Then I run hardlink . -t -p
and it dropped the size to 32GB. What I described in #1448 (comment) is already happening. My view is that such way to package rocm it is not sustainable even if the extra energy requirements for the buildsystem could be met, because it provides a sub par experience to the end user, if compared to the competing GPU compute stacks where building an app once is enough. I already gave your above suggestion of course, but it is now a matter of opinion if the workaround is worse than the disease. So I don't see a clear path to unwind the rocm build loop until this issue fixed... Thanks!
another temopray option if you dont want to drop any gpus in your builds might be to build "gfx90a" or just "gfx90a:xnack-" the xnack+ configuration is very rare and omitting it do sent leave any user totally in the cold (just with possibly reduced performance depending on workload) and "gfx90a" should emit code that works in both xnack+ and xnack- modes.
all gfx9 gpus support xnack+, the fact that only gfx90a is built both ways is a clear hint here as to how common this is
We have changed to only build our source kernels with xnack "any" for gfx90a after commit 6a267fd. We expect to adjust our gfx list before release and as always we ensure there are no linking issues on all supported OS and with any final target list. Other subdivisions of the library along functionality are also possible but none are trivial changes. Clang compiler and linker mcmodel flag changes are also possible with the current library design along with the target varations mentioned in earlier comments.
This bug should likely be considered fixed and the issue closed as when you built rocblas with our supported gfx list you didn't get the error. A new issue could be created as it is unclear to me your N different ROCm use case and why the app is rebuilt and linked against all of them and not built against the latest. If your application is open source please refer to it in your new issue and detail why it is built separately for each gfx. Or if this is really just a request to support more gfx then word it as such along with your use case and gfx list. If you rebuilt rocm or rocblas with one gfx in each version please also clarify that in your new feature request issue. It could be your new issue should be in ROCm if not particular to rocBLAS.
I think wanting to build a version of rocblas for all targets that work as opposed to just the default targets is a reasonable desire esp from a distro maintainers perspective where amd's support status of a specific architecture is not important, it only needs to work.
Further as you expand rocblas and its supported target architectures you will eventually hit a wall here that will force you to change how this works, it is for instance impossible for you to support the full range of your released devices using the current implementation unlike for instance how cublas supports a huge range of devices, down to ones much older than gfx900.
Thus at the very least this i request this be left open as at least a feature request.
https://src.fedoraproject.org/rpms/python-torch/blob/rawhide/f/python-torch.spec#_998
https://src.fedoraproject.org/rpms/python-torch/blob/rawhide/f/python-torch.spec#_37
https://src.fedoraproject.org/rpms/rocm-rpm-macros/blob/rawhide/f/default
https://src.fedoraproject.org/rpms/rocm-rpm-macros/blob/rawhide/f/gfx9
https://src.fedoraproject.org/rpms/rocm-rpm-macros/blob/rawhide/f/gfx90a
https://src.fedoraproject.org/rpms/rocblas/blob/rawhide/f/rocblas.spec#_117
https://src.fedoraproject.org/rpms/rocblas/blob/rawhide/f/rocblas.spec#_142
https://src.fedoraproject.org/rpms/rocsolver/blob/rawhide/f/rocsolver.spec#_136
rocsolver has also the loop, but because it depends on rocblas it's not possible to tell if rocsolver is like yet another app depending on rocblas (including pytorch) having to be rebuilt N times as a dependency on rocblas being rebuilt N times.
For example, see there's no rocm build loop in packages that don't depend on rocblas like the opencl runtime:
https://src.fedoraproject.org/rpms/rocm-runtime/blob/rawhide/f/rocm-runtime.spec#_50
For the time being rocblas appears the origin of the rocm build loop caused by the link failure above, but if other rocm parts also share this same issue, once rocblas is fixed, the fix can trickle there too I assume. The core does not seem to have this issue and in fact it is being built only once as one would expect.
I can open a new issue but we'd lose part of the context so for now I post it here. We can always open a new issue later. The goal is to fix the build so it doesn't fail linkage, once fixed it wouldn't add any noticeable feature to the software other than succeeding a build that previously failed as far as I can tell.
Thank you!
About the possible solutions mentioned, I agree that your -mcmodel= (I suppose "medium") suggestion for the short term appears the most attractive solution. In fact, wishful thinking, I wonder if there's a chance it could already be switched on through some environment variable.
The compiler team is looking into adjusting the layout of the shared libraries. I'm not sure of the details of their proposal, but the gist of it is to move all the offload bundles to either the beginning or the end of the library, which will ensure that all host code is close together (and therefore not need more than 32-bit offsets). I'm not sure the status of that proposal, but it seemed promising.
About the possible solutions mentioned, I agree that your -mcmodel= (I suppose "medium") suggestion for the short term appears the most attractive solution.
The last time I checked, the mcmodel flag did not actually work when compiling HIP code. Although, perhaps that has changed.
it's not possible to tell if rocsolver is like yet another app depending on rocblas (including pytorch) having to be rebuilt N times as a dependency on rocblas being rebuilt N times.
The rocBLAS ABI does not change depending on the gfx architecture it was built for. Fedora has built rocBLAS for multiple architectures as separate packages, but you can build your application against any of those rocblas packages and it will work with all of them.
I don't think they're rebuilding the rocSOLVER or pytorch libraries because of rocBLAS. It is likely because those libraries/packages also benefit from the same architecture-splitting that they did with rocBLAS. @Mystro256 or @trixirt might be able to shed more light on Fedora's choices here.
We faced this problem last year getting started when I was first building rocBLAS for all the targets. There was discussion within Fedora about how to work around this and what we have in place is the solution. The builds are split along major family lines. This was done to keep the explanation of where your gpu's was in the split simple. So we have atm gfx8, gfx9, gfx10 and gfx11 in F40. The prefix for this is /usr/lib64/rocm/gfxXX . There is also a special set 'default' which is the union of gfx10 and gfx11, these install the normal prefix /usr. The main tradeoff that was made was to include as many targets as possible . This splitting does make it more challenging for packaging but that is why we have spec files and do work to make it happen in Fedora, with pytorch being an example of making it happen. Time to build is not a major concern.
If/when something changes in the upstream, I and other folks in Fedora ROCm packaging sig will readdress how rocBLAS and similar are built to maximize inclusiveness of gpu targets.
IMO this is a case of perfection being the enemy of good.
One thing to note on this split package option is of course that any system with multiple gpus of different architectures will not be supported by your scheme. For this reason i would strongly recommend building as few versions of rocblas as possible, at most 2, to raise the chance of any given heterogeneous system being supported
@IMbackK agreed: running different gfxN in the same app has become impossible as result of the 5 rocm builds with the finegrined split. That is on top of having to pick the right rpm and/or binary of the AI APP or it won't work even with one GPU.
I asked to do a most 2 builds some time ago, so thanks for suggesting it too.
The "Good" to me is that there's just 1 binary of every AI app (be it llama-cpp, vllm, mojo, triton, pytorch, etc..) and there's no multiplication x5 of both rpm and binaries of every AI/GPU app under the sun, so when the AI dev that normally runs on the popular, but proprietary, GPU compute stack tries his favorite AI app on Open Source ROCm, it just works without extra complications compared to the previous experience.
Rejecting all suggestions above, some of which also allows to achieve full inclusiveness with a single build of rocblas (or at most 2 builds) to achieve minor extra optimizations, to me defines as the "Perfection".
Hello,
The rocBLAS ABI does not change depending on the gfx architecture it was built for. Fedora has built rocBLAS for multiple architectures as separate packages, but you can build your application against any of those rocblas packages and it will work with all of them.
It's hard to see how my app can work against all rocblas packages because it won't know where to find the file it needs at runtime, unless such knowledge is injected with "module load" just before the build, by rebuilding it N times with N different "module load" commands.
In other words the reason of the app being rebuilt N times in the best case could be just path differences, as workaround that rocblas wasn't meant to be built N times and installed in N different places.
rocblas rebuilt N times is a workaround around this rocblas build time link failure, and if you wish the app rebuilt N times are further orthogonal workarounds for the path differences caused by the first workaround, and the buildsystem takes one more hit at every step of the way, workaround on top of workaround.
I would have been content to get away with just 1 app binary for testing, using ROCBLAS_TENSILE_LIBPATH (or some other hidden env variable I found randomly to try to force it find the right files for the needed rocblas). That didn't appear to work and it segfaulted. Possibly I did some mistake, maybe that could have worked if I insisted in that direction, but I don't think an user should be required to set ROCBLAS_TENSILE_LIBPATH to some directory by hand specific to the GPU in use, for the app not to spawns some error about missing files in /usr/ with no hint on how to resolve it.
Among other suggestions to avoid the rebuild of all apps N times, before filing this issue, I tried to ask if the N rocblas builds could be installed in the same path location hoping then one app binary would just work. Problem is there's some file collisions, each build generates files in the same location that can't be automatically disambiguated. I had a quick look at the collisions of the non ELF-x86 parts and they didn't seem hard to disambiguate, but now I wonder if even /usr/lib64/librocblas.so (the otherwise >2G shared lib) would be one of them. I could imagine this path would be workable if you then accept not to be able to run on all gfx using the same container image (not just at runtime from the same binary among the N built, which as discussed earlier is already not possible).
Thanks!
I would have been content to get away with just 1 app binary for testing, using ROCBLAS_TENSILE_LIBPATH (or some other hidden env variable I found randomly to try to force it find the right files for the needed rocblas). That didn't appear to work and it segfaulted. Possibly I did some mistake, maybe that could have worked if I insisted in that direction, but I don't think an user should be required to set ROCBLAS_TENSILE_LIBPATH to some directory by hand specific to the GPU in use, for the app not to spawns some error about missing files in /usr/ with no hint on how to resolve it.
This cant work, rocblas contains gpu code inside librocblas.so not just its modules and that gpu code must be available for all gpus in the system supported by the runtime.
You can, however just build several librocblas.so with different prefixes, and then use LD_LIBRARY_PATH or LD_PRELOAD to select witch one, no application rebuild required.
Hello, is this issue still being worked on?
Yes we have have ongoing work on this topic, when anything relevant lands in develop branch I will comment here. Larger changes may not occur until a major release. gfx940 and gfx941 could be removed from the target list you build IMO.
Just to keep you all in the loop, one possible solution has landed in develop commit bb81a83 in which we enable llvm clang's hip offload compiler option --offload-compress. See https://clang.llvm.org/docs/ClangCommandLineReference.html This may provide a significant (around 90%) size reduction with little impact on build time, so along with removing the gfx940 and gfx941 targets as suggested earlier, it will greatly reduce the linked .so size for develop branch. The compiler flag requires the ROCm 6.2 clang or later. We are still evaluating it, but if it passes all our requirements it should be used by default in a future release.
Just to keep you all in the loop, one possible solution has landed in develop commit bb81a83 in which we enable llvm clang's hip offload compiler option --offload-compress. See https://clang.llvm.org/docs/ClangCommandLineReference.html This may provide a significant (around 90%) size reduction with little impact on build time, so along with removing the gfx940 and gfx941 targets as suggested earlier, it will greatly reduce the linked .so size for develop branch. The compiler flag requires the ROCm 6.2 clang or later. We are still evaluating it, but if it passes all our requirements it should be used by default in a future release.
Thanks for suggestion and it is simple to implement!
I did a number of experiments and compression is no worse than 80%.
Another bit of info, DEBUG makes .hip_fatbin about 2x and if you have no gpu debugger (Fedora doesn't yet) this doesn't make sense to have.
rocSPARSE and rocSOLVER could also benefit from this option.
Here is what I did in Fedora
https://src.fedoraproject.org/rpms/rocblas/c/5aa46bcd5960478990cd692f9339afb4265050d3?branch=rawhide
The high compression ratio is not a big surprise given that there a really only 2 isas. i would expect all of gcn+cdna and all of rdna to generate each essentially the same code aside from edge cases.
In terms of the original issue "relocation R_X86_64_PC32 out of range" the use of compiler option --offload-compress and reduction of targets mentioned should resolve this issue (compiler option landed in ROCm 6.2 but was new feature not used by rocBLAS). Offload compression will be used in rocBLAS for the upcoming release ROCm 6.3. Additional work will be ongoing for Tensile and rocBLAS to utilize new compiler features as they land with regard to the more general topic of fat binary size reductions. Thus I will close this issue, thanks to all for the discussion.