[BUG] Extremely long wall clock time when using the marker API in OpenMP code

Question

[BUG] Extremely long wall clock time when using the marker API in OpenMP code

tdollenb opened this issue 3 months ago · 9 comments

When using the marker API along with OpenMP I observed a huge runtimes with likwid-perfctr. I was able to reproduce the issue with the tutorial on markers in C. The tutorial code runs 40 times longer on my system when using likwid-perfctr -m compared to running without -m. However, the measurements seem correct, the application seems to be halted a lot.

System: Intel SPR (Xeon(R) Platinum 8468), 2x48Cores
OS: LINUX, ROCKY 8.10
LIKWID version is 5.3.0 with additional bugfixes up to commit ID -d8fea29 deployed with Easybuild.
Run command: likwid-perfctr -g BRANCH -C 0-95 -m ./a.out. The code is the same as in the marker tutorial.
Compiled with Intel(R) oneAPI DPC++/C++ Compiler 2023.1.0 (2023.1.0.20230320)
Compilation command: icx -fiopenmp -O3 -DLIKWID_PERFMON -llikwid marker_tutorial.c

Output with -m:
likwid_dump.txt

Answer 1 · 2024-09-11T12:36:37.000Z

The code on the tutorial page is just a demo how the code structure should look like when adding the MarkerAPI. It is probably running too short for anything helpful. According to your output, the OpenMP threads had an average runtime of 0.0007s.

The most runtime is spent setting up the counting infrastructure at LIKWID_MARKER_REGISTER (or the first LIKWID_MARKER_START if no LIKWID_MARKER_REGISTER was done before) for each OpenMP thread. This happens only once and applications with significant runtime should not be affected much. Your selected ACCESSMODE=accessdaemon has the largest startup time.

Answer 2 · 2024-09-11T14:11:43.000Z

I used the tutorial application to reproduce it with something small, but I can see the issue with larger, long running applications aswell. The sparse matrix vector multiplication code I observed it with is below. I tried to use LIKWID_MARKER_REGISTER aswell but it had no effect on the runtime:

LIKWID_MARKER_INIT;
#pragma omp parallel num_threads(tOptions->iNumThreads)
{
	LIKWID_MARKER_THREADINIT;
	LIKWID_MARKER_REGISTER("Compute");
}

// take the time: start
t1 = omp_get_wtime();

#pragma omp parallel num_threads(tOptions->iNumThreads)
{
	LIKWID_MARKER_START("Compute");


	// run kernel iNumRepetitions times
	for (int rep = 0; rep < iNumRepetitions; rep++)
	{
		timings[rep].dBegin = omp_get_wtime();

		// yAx-kernel
			
			#pragma omp for schedule(static, CHUNK_SIZE) 
			for (int i = 0; i < tInput->stNumRows; i++)
			{
				double sum = 0.0;
				int rowbeg = Arow[i];
				int rowend = Arow[i+1];
				int nz;
				for (nz = rowbeg; nz < rowend; nz++)
				{
					sum += Aval[nz] * x[ Acol[nz] ];
				}
				y[i] = sum;
			}
		
		timings[rep].dEnd = omp_get_wtime();
	}
	LIKWID_MARKER_STOP("Compute");
}

// take the time: end
t2 = omp_get_wtime();
LIKWID_MARKER_CLOSE;

It still runs significantly longer with -m than without.
Output with -m (wall clock time 7 minutes):
likwid_dump.txt
Output without -m (wall clock time 47 seconds):
likwid_dump.txt

Answer 3 · 2024-09-11T14:24:16.000Z

What does your internal timing tell (t2 - t1)? If it is close to the runtime without Markers, the time is spent in the REGISTER call. If you remove the register call, the same operations happen at the first call of START, so you do not gain anything, you can just move it out of the performance critical part.
Each thread starts up the accessdaemon which has to do quite some heavy operations for SapphireRapids systems to get the list of counters. This heavy operation was made more lightweight in the master branch but it is still quite heavy. The startup of the accessdaemon, setting up of the UNIX domain socket between the library and the daemon, etc. often cause some kind of serialization. So although the startup could be done in parallel, is not, so the more threads, the more waiting time in the beginning.

But yes, the increase in wall time seems extreme in your case.

P.S. Please remove LIKWID_MARKER_THREADINIT. It is not needed in most cases. It is only required if you use an unsupported threading solution where the thread pinning failed.

Answer 4 · 2024-09-11T14:40:56.000Z

Yes the runtime is close to the reported runtime. It seems the actual measurements are correct. The REGISTER call takes 233 seconds for my 96 threads. It also seems to scale with the number of threads used.

Answer 5 · 2024-09-11T14:51:30.000Z

I am not sure which operation of the access daemon causes this serialization effect. It seems to be already the starting of the processes itself. There is nothing I can do about it. You are not the first one reporting issues with Rocky Linux. Maybe they use some permissive settings. We do not operate Rocky Linux in our center.

P.S. You do not see it in the non-MarkerAPI case because in this mode only one access daemon is started to cover all threads. Only for the MarkerAPI, one access daemon per thread is required to allow concurrent accesses to the counters.

Answer 6 · 2024-09-11T15:03:46.000Z

Thank you for clarifying this, as the measurements are correct regardless, this is just a convenience issue for me.

Answer 7 · 2024-09-11T15:06:08.000Z

The time should reduce a lot when using the master branch or the next release.

Answer 8 · 2024-09-12T10:09:41.000Z

A comment on the access code: Is it correct that this malloc could cause a memory leak since LIKWID_MARKER_REGISTER can be performed by multiple threads? It does not seem to be protected by a mutex and multiple threads would overwrite the cpulocks static variable.

Answer 9 · 2024-09-12T11:09:34.000Z

Yes and no. LIKWID starts one accessdaemon in the main thread at LIKWID_MARKER_INIT which initializes cpuSockets, daemon_pids and cpuLocks. Consequently, the checks for all these structures are evaluated to false for all accessdaemons starting in LIKWID_MARKER_REGISTER or LIKWID_MARKER_START. But you cannot call LIKWID_MARKER_REGISTER or LIKWID_MARKER_START in parallel directly. It will complain that there was no LIKWID_MARKER_INIT and return.

So yes, a potential memory leak but in general avoided by the logic. You are right, it should be guarded with a lock to avoid pitfalls in the future.

If you are searching what causes the long startup times, it is my implementation of Intel's Uncore Discovery Mechanism introduced with SapphireRapids.