[Bug]: rocBLAS fails tests badly in FP16 for distro packages
littlewu2508 opened this issue · 9 comments
Describe the bug
Distro rocBLAS-5.6.0 (compiled with upstream llvm-16) fails many FP16 related tests. Both seen on MI210 and Radeon VII. Details can be seen in gzipped test.log:
MI210-test.log.gz
RadeonVII-test.log.gz
The build log is also appended:
MI210-build.log.gz
RadeonVII-build.log.gz
@littlewu2508 ,
Could you update some of the missing information such as build log, environment.txt etc., to further investigate the issue.
Please refer to the Bug template here
To Reproduce
This result comes from running src_test
in Gentoo sc-libs/rocBLAS-5.6.0
package. Currently the package is in this test branch
In Gentoo system, you can replace the default repo with this experiment branch, then build and test rocBLAS:
cd /var/db/repos
mv gentoo{,.bak}
git clone -b rocm-5.6 https://github.com/littlewu2508/gentoo.git
echo 'ACCEPT_KEYWORDS="~amd64"' > /etc/portage/make.conf
mkdir -p /etc/portage/env /etc/portage/package.use
echo 'FEATURES=test' > /etc/portage/env/test.conf
echo 'sci-libs/rocBLAS test.conf' >> /etc/portage/package.env
emerge "=sci-libs/rocBLAS-5.6.0"
Expected behavior
All tests pass.
Log-files
The complete build-and-test log is
MI210-test.log.gz
MI210-build.log.gz
RadeonVII-build.log.gz
RadeonVII-test.log.gz
Environment
There are two environments
MI210
Hardware | description |
---|---|
CPU | AMD EPYC 7763 |
GPU | AMD Instinct MI210 |
Software | version |
---|---|
kernel | Debian 6.1.27-1 (2023-05-08) x86_64 |
llvm/clang | Gentoo 16.0.6 |
rocm-core | Gentoo rocm-5.6.0 |
rocblas | Gentoo rocm-5.6.0 |
Radeon VII
Hardware | description |
---|---|
CPU | AMD Ryzen 7 5800X |
GPU | AMD Radeon VII |
Software | version |
---|---|
kernel | Linux 6.3.2 |
llvm/clang | Gentoo 16.0.6 |
rocm-core | Gentoo rocm-5.6.0 |
rocblas | Gentoo rocm-5.6.0 |
@littlewu2508 ,
I tried to follow the steps provided by you to reproduce the issue in a Gentoo environment, but I was unable to successfully compile the rocBLAS because of the following error
(masked by: ~amd64 keyword)
I tried to follow some steps to unmask it , but no luck. Not very familiar with Gentoo environment. Any pointers on how to proceed further?
I was not able to reproduce this issue using ROCm 5.6 in Ubuntu
@littlewu2508 , I tried to follow the steps provided by you to reproduce the issue in a Gentoo environment, but I was unable to successfully compile the rocBLAS because of the following error
(masked by: ~amd64 keyword)
Sorry I made a mistake in reproducing steps. Try adding ACCEPT_KEYWORDS="amd64"
to echo 'ACCEPT_KEYWORDS="~amd64"' > /etc/portage/make.conf
I tried to follow some steps to unmask it , but no luck. Not very familiar with Gentoo environment. Any pointers on how to proceed further?
I was not able to reproduce this issue using ROCm 5.6 in Ubuntu
If you're using the official ROCm stack shipped by repo.radeon.com and with upstream kernel installed, then you shouldn't encounter this issue. I does not reproduce it as well on Debian12 with .deb from repo.radeon.com installed. So I guess it's Gentoo use upstream LLVM that causes all discrepancies.
@littlewu2508,
Thanks for updated steps, I will try to reproduce. I had a discussion with internally with ROCm team and we are guessing it could be a ABI mismatch causing half precision test to fail.
Would you be able to try some of the suggestions from ROCm team provided in rocFFT Issues #439
For reproducing the error, you could use the sample program provided here in Gentoo environment.
And maybe you could try this suggestion to verify if it resolves the issue
@littlewu2508, Thanks for updated steps, I will try to reproduce. I had a discussion with internally with ROCm team and we are guessing it could be a ABI mismatch causing half precision test to fail.
Would you be able to try some of the suggestions from ROCm team provided in rocFFT Issues #439
For reproducing the error, you could use the sample program provided here in Gentoo environment.
And maybe you could try this suggestion to verify if it resolves the issue
Thank you very much for these suggestions. I have also reproduced the float16.cpp
issue, only -O3
generate sensible outputs. I will keep tracking ROCm/rocFFT#439
@littlewu2508 , Fedoro fix for half precisions is below:
https://src.fedoraproject.org/fork/tstellar/rpms/compiler-rt/blob/0459cbc5d9eb15f1ad51d74707b4988049183708/f/0001-compiler-rt-Fix-FLOAT16-feature-detection.patch
@littlewu2508 , Fedoro fix for half precisions is below: https://src.fedoraproject.org/fork/tstellar/rpms/compiler-rt/blob/0459cbc5d9eb15f1ad51d74707b4988049183708/f/0001-compiler-rt-Fix-FLOAT16-feature-detection.patch
Thank you! Is this patch submitted to llvm-project upstream?
@littlewu2508,
Do you still need any assistance from rocBLAS ? if not please feel free to close this ticket.