[BUG] Extremely long wall clock time when using the marker API in OpenMP code
tdollenb opened this issue · 9 comments
When using the marker API along with OpenMP I observed a huge runtimes with likwid-perfctr
. I was able to reproduce the issue with the tutorial on markers in C. The tutorial code runs 40 times longer on my system when using likwid-perfctr -m
compared to running without -m
. However, the measurements seem correct, the application seems to be halted a lot.
- System: Intel SPR (Xeon(R) Platinum 8468), 2x48Cores
- OS: LINUX, ROCKY 8.10
- LIKWID version is 5.3.0 with additional bugfixes up to commit ID -d8fea29 deployed with Easybuild.
- Run command:
likwid-perfctr -g BRANCH -C 0-95 -m ./a.out
. The code is the same as in the marker tutorial. - Compiled with Intel(R) oneAPI DPC++/C++ Compiler 2023.1.0 (2023.1.0.20230320)
- Compilation command:
icx -fiopenmp -O3 -DLIKWID_PERFMON -llikwid marker_tutorial.c
Output with -m
:
likwid_dump.txt
The code on the tutorial page is just a demo how the code structure should look like when adding the MarkerAPI. It is probably running too short for anything helpful. According to your output, the OpenMP threads had an average runtime of 0.0007s.
The most runtime is spent setting up the counting infrastructure at LIKWID_MARKER_REGISTER
(or the first LIKWID_MARKER_START
if no LIKWID_MARKER_REGISTER
was done before) for each OpenMP thread. This happens only once and applications with significant runtime should not be affected much. Your selected ACCESSMODE=accessdaemon
has the largest startup time.
I used the tutorial application to reproduce it with something small, but I can see the issue with larger, long running applications aswell. The sparse matrix vector multiplication code I observed it with is below. I tried to use LIKWID_MARKER_REGISTER
aswell but it had no effect on the runtime:
LIKWID_MARKER_INIT;
#pragma omp parallel num_threads(tOptions->iNumThreads)
{
LIKWID_MARKER_THREADINIT;
LIKWID_MARKER_REGISTER("Compute");
}
// take the time: start
t1 = omp_get_wtime();
#pragma omp parallel num_threads(tOptions->iNumThreads)
{
LIKWID_MARKER_START("Compute");
// run kernel iNumRepetitions times
for (int rep = 0; rep < iNumRepetitions; rep++)
{
timings[rep].dBegin = omp_get_wtime();
// yAx-kernel
#pragma omp for schedule(static, CHUNK_SIZE)
for (int i = 0; i < tInput->stNumRows; i++)
{
double sum = 0.0;
int rowbeg = Arow[i];
int rowend = Arow[i+1];
int nz;
for (nz = rowbeg; nz < rowend; nz++)
{
sum += Aval[nz] * x[ Acol[nz] ];
}
y[i] = sum;
}
timings[rep].dEnd = omp_get_wtime();
}
LIKWID_MARKER_STOP("Compute");
}
// take the time: end
t2 = omp_get_wtime();
LIKWID_MARKER_CLOSE;
It still runs significantly longer with -m
than without.
Output with -m
(wall clock time 7 minutes):
likwid_dump.txt
Output without -m
(wall clock time 47 seconds):
likwid_dump.txt
What does your internal timing tell (t2 - t1
)? If it is close to the runtime without Markers, the time is spent in the REGISTER call. If you remove the register call, the same operations happen at the first call of START, so you do not gain anything, you can just move it out of the performance critical part.
Each thread starts up the accessdaemon which has to do quite some heavy operations for SapphireRapids systems to get the list of counters. This heavy operation was made more lightweight in the master
branch but it is still quite heavy. The startup of the accessdaemon, setting up of the UNIX domain socket between the library and the daemon, etc. often cause some kind of serialization. So although the startup could be done in parallel, is not, so the more threads, the more waiting time in the beginning.
But yes, the increase in wall time seems extreme in your case.
P.S. Please remove LIKWID_MARKER_THREADINIT
. It is not needed in most cases. It is only required if you use an unsupported threading solution where the thread pinning failed.
Yes the runtime is close to the reported runtime. It seems the actual measurements are correct. The REGISTER call takes 233 seconds for my 96 threads. It also seems to scale with the number of threads used.
I am not sure which operation of the access daemon causes this serialization effect. It seems to be already the starting of the processes itself. There is nothing I can do about it. You are not the first one reporting issues with Rocky Linux. Maybe they use some permissive settings. We do not operate Rocky Linux in our center.
P.S. You do not see it in the non-MarkerAPI case because in this mode only one access daemon is started to cover all threads. Only for the MarkerAPI, one access daemon per thread is required to allow concurrent accesses to the counters.
Thank you for clarifying this, as the measurements are correct regardless, this is just a convenience issue for me.
The time should reduce a lot when using the master
branch or the next release.
A comment on the access code: Is it correct that this malloc could cause a memory leak since LIKWID_MARKER_REGISTER
can be performed by multiple threads? It does not seem to be protected by a mutex and multiple threads would overwrite the cpulocks
static variable.
Yes and no. LIKWID starts one accessdaemon in the main thread at LIKWID_MARKER_INIT
which initializes cpuSockets
, daemon_pids
and cpuLocks
. Consequently, the checks for all these structures are evaluated to false
for all accessdaemons starting in LIKWID_MARKER_REGISTER
or LIKWID_MARKER_START
. But you cannot call LIKWID_MARKER_REGISTER
or LIKWID_MARKER_START
in parallel directly. It will complain that there was no LIKWID_MARKER_INIT
and return.
So yes, a potential memory leak but in general avoided by the logic. You are right, it should be guarded with a lock to avoid pitfalls in the future.
If you are searching what causes the long startup times, it is my implementation of Intel's Uncore Discovery Mechanism introduced with SapphireRapids.