rocblas thread safety and level3 initialization overhead
jakub-homola opened this issue · 5 comments
Is rocblas thread safe? (or maybe rather "thread-aware"?)
Assuming I have different rocblas_handle
s, different streams for each of them, different data, but the same gpu. I did not find anything about it in the documentation.
I am having a strange issue with initialization, when I call a level3 rocblas function (syrk in my use case) for the first time in an OpenMP parallel region, the initialization seems to occur in each thread and the rocblas calls seem sequentialized, significantly slowing down the program. I use rocm-5.4.3, I don't have the ability to try newer.
I could elaborate more on the issue, but first I wanted to ask if it is even valid use, to call rocblas functions from different host threads concurrently.
Hello @jakub-homola,
Thank you for reporting this issue. When you meant:
the initialization seems to occur in each thread
I suppose that you are taking about the call to rocblas_initialize(). The rocblas_initialize() call does not have to occur in each thread and it has to be called once outside the OpenMP parallel region.
Even after this change if you are still facing the performance drop, feel free to post the example code here and I will take a look at it.
Let me know if you have further questions.
Naveen
I don't explicitly call roclbas_initialize()
in a parallel region. Here is the issue:
Now that I went throroughly through it, it might just be a problem with the installation I am using. But I cannot be sure as I don't have access to any different AMD GPU machine with the same rocm version. But I will report it anyway, I would appreciate if you could try and replicate it on 5.4.3 too to see if it's just a problem with my installation.
Describe the bug
If the first rocblas_syrk
function call occurs from several threads concurrently, the time of the syrk call rises unexpectedly.
To Reproduce
Precise version of rocBLAS installed: the rocblas that comes with rocm-5.4.3. In rocblas-version.h I see version 2.46
Steps to reproduce the behavior:
Use the source.hip.cpp.txt file (remove the .txt extension). There, in the loop on lines 61-82 I create several streams, rocblas handles, allocate and initialize matrices. I also query for the workspace buffer size needed by the syrk function, and allocate and assign the memory to rocblas, to ensure asynchronous submit of the compute stage. In the next loop starting on line 87, I submit all the rocblas_dsyrk
kernels and measure the time it took to submit. Finally I free the memory and destroy the handles and streams. All this is repeated 3 times.
Compile using hipcc -g -O2 -fopenmp --offload-arch=gfx90a:sramecc+:xnack- source.hip.cpp -o program.x -lrocblas
and run.
(1) This was the sequential program that works as expected. See output_seq.txt. The first ever call to (the compute stage of) rocblas_dsyrk()
takes a loger time (around 400 ms) due to some initializations, the rest of the calls are very quick, as they just submit the work to the gpu. The first repeat takes around 800 ms due to all the initializations, the remaining ones take only around 70 ms.
(2) Now, the (probably) buggy part. Uncomment the #pragma omp parallel
on line 86 to make the loop parallel. The output I get is output_par_08.txt. There you can see, that in the first repeat, the rocblas_dsyrk
calls take significantly longer amount of time, which significantly increases the total time of that repetition to over 2.2 s. From the timings it seems like if the 8 threads that run the first syrk concurrently did some initialization, which got sequentialized, as the syrk calls finish with approximately uniform gaps. The remaining two repetitions run in ~70 ms as expected, since everything is already initialized. Running the loop with 16 threads instead of 8 makes things even worse: output_par_16.txt
(3) I can work around this problem by calling the syrk from only one thread at a time, thus introducing the #pragma omp critical
on line 93. The output then looks like this output_par_08_critical.txt. The first thread that encounters the syrk does the initializations. The running threads must wait for that first syrk submit to finish, so it takes ~400 ms for the first batch of syrks, but I am okay with that. The repetition times are approx. equal to the original sequential version.
(4) If I remove the critical region and keep the loop parallel (as in (2)), but now call rocblas_initialize()
at the start of the program, the issue now also goes away and it seems to work as expected: output_par_init.txt. This again suggests that there is some issue with initialization being called in parallel. My issue with this is, that it takes way too long to initialize everything in rocblas so that it is not worth it for me.
This is just a demonstration, my real program is way more complicated, and being able to submit the syrk kernels in parallel allows me to do more cpu-gpu overlap.
Expected behavior
Similar to (3) with the critical, but where I don't have to use the critical region, as it should be somewhere on the inside of rocblas.
Log-files
Running (2) with ROCBLAS_LAYER=7
: log.txt
Environment
LUMI supercomputer, accelerated compute node. AMD EPYC 7A53 64-Core Processor, AMD MI250X.
I am using rocm-5.4.3 manually installed by a colleague. With the officialy supported rocm-5.2.3 ... I just found out that it works just fine, and the initialization times are tiny. Okay, this might just be a problem with the installation, but I have no idea how to find out. Maybe the update made instruduced the issue. Could you please try to reproduce this?
With rocm/5.4.3 I see rocblas 2.46
With rocm/5.2.3 I see rocblas 2.44
environment.txt
lspci: command not found
Additional context
As I am now trying to experiment with different, but still not officially supported rocm versions, the core of the issue seems to continue even with rocm/5.7.1, although the timings are different now.
@jakub-homola ,
The thread safety is implemented using locks, hence you are noticing the high initialization time and sequential behavior for the first run. The initialization times have decreased in the latest versions of ROCm, but would still exhibit the same behavior as other threads must wait for the first thread to complete the initialization.
Summary of initialization times:
Threads # | ROCm Ver | Initialization time |
---|---|---|
8 | 5.4.3 | 2220 ms |
8 | 6.1.1 | 838 ms |
16 | 5.4.3 | 3725 ms |
16 | 6.1.1 | 1142 ms |
output_par_8.txt
output_par_16.txt
If the above initialization time seems reasonable for your use case, you could upgrade the ROCm version to 6.1.1.
@rkamd
Thanks for the reply.
I understand that the thread safety is implemented using locks, that seems reasonable.
The issue was, that if I use the locks myself instead of relying on the locks inside rocblas (that is the difference between (2) and (3)), then the runtimes are much more reasonable -- using the rocblas's locks had weird behavior, it seemed like the initialization happened again in each thread sequentially, instead of only in one.
Looking at the outputs you gathered, they look much more reasonable now with the new ROCm versions.
I would like to update to 6.1.1, unfortunately this is not up to me, but up to LUMI admins, which are appearently extremely conservative with updates.
I will test it when the new version will be available for me.
Anyway, according to your outputs, my main issue seems to be solved, so I will close this. Will reopen if necessary.
Thanks.
I'm not a rocBLAS/Tensile developer, however, I believe it is possible to make dramatic improvements to rocBLAS/Tensile initialization times, especially for cases like the one you've described. The scope of work required, however, is quite significant. If the initialization performance in ROCm 6.1 is still a pain point, please don't be afraid to say so. User feedback is important for prioritizing optimizations.