[BUG] Silent failure on multi-threaded runs
ivan-pi opened this issue · 5 comments
Describe the bug
likwid-pin appears to silently fail when using more than one thread, judging by the fact that the command exits almost immediately, and nothing is written to standard output.
To Reproduce
-
LIKWID command and/or API usage:
$ likwid-pin -V 2 -c 0,1 ./albm
-
LIKWID version and download source (Github, FTP, package manger, ...):
likwid-pin -- Version 5.3.0 (commit: 0123456789)
-
Operating system:
Linux maxwell 5.15.0-100-generic #110~20.04.1-Ubuntu SMP
-
Does your application use libraries like MPI, OpenMP or Pthreads? Yes, OpenMP.
-
Are you using the MarkerAPI (CPU code instrumentation)? No.
To Reproduce with a LIKWID command
Please supply the output of the command with -V 3
added to the command:
(base) ivan@maxwell:~/lrz/rbfxlbm/build$ likwid-pin -V 3 -c 0,1 ./albm
DEBUG - [hwloc_init_cpuInfo:359] HWLOC CpuInfo Family 6 Model 167 Stepping 1 Vendor 0x0 Part 0x0 isIntel 1 numHWThreads 16 activeHWThreads 16
DEBUG - [hwloc_init_nodeTopology:568] HWLOC Thread Pool PU 0 Thread 0 Core 0 Die 0 Socket 0 inCpuSet 1
DEBUG - [hwloc_init_nodeTopology:568] HWLOC Thread Pool PU 8 Thread 1 Core 0 Die 0 Socket 0 inCpuSet 1
DEBUG - [hwloc_init_nodeTopology:568] HWLOC Thread Pool PU 1 Thread 0 Core 1 Die 0 Socket 0 inCpuSet 1
DEBUG - [hwloc_init_nodeTopology:568] HWLOC Thread Pool PU 9 Thread 1 Core 1 Die 0 Socket 0 inCpuSet 1
DEBUG - [hwloc_init_nodeTopology:568] HWLOC Thread Pool PU 2 Thread 0 Core 2 Die 0 Socket 0 inCpuSet 1
DEBUG - [hwloc_init_nodeTopology:568] HWLOC Thread Pool PU 10 Thread 1 Core 2 Die 0 Socket 0 inCpuSet 1
DEBUG - [hwloc_init_nodeTopology:568] HWLOC Thread Pool PU 3 Thread 0 Core 3 Die 0 Socket 0 inCpuSet 1
DEBUG - [hwloc_init_nodeTopology:568] HWLOC Thread Pool PU 11 Thread 1 Core 3 Die 0 Socket 0 inCpuSet 1
DEBUG - [hwloc_init_nodeTopology:568] HWLOC Thread Pool PU 4 Thread 0 Core 4 Die 0 Socket 0 inCpuSet 1
DEBUG - [hwloc_init_nodeTopology:568] HWLOC Thread Pool PU 12 Thread 1 Core 4 Die 0 Socket 0 inCpuSet 1
DEBUG - [hwloc_init_nodeTopology:568] HWLOC Thread Pool PU 5 Thread 0 Core 5 Die 0 Socket 0 inCpuSet 1
DEBUG - [hwloc_init_nodeTopology:568] HWLOC Thread Pool PU 13 Thread 1 Core 5 Die 0 Socket 0 inCpuSet 1
DEBUG - [hwloc_init_nodeTopology:568] HWLOC Thread Pool PU 6 Thread 0 Core 6 Die 0 Socket 0 inCpuSet 1
DEBUG - [hwloc_init_nodeTopology:568] HWLOC Thread Pool PU 14 Thread 1 Core 6 Die 0 Socket 0 inCpuSet 1
DEBUG - [hwloc_init_nodeTopology:568] HWLOC Thread Pool PU 7 Thread 0 Core 7 Die 0 Socket 0 inCpuSet 1
DEBUG - [hwloc_init_nodeTopology:568] HWLOC Thread Pool PU 15 Thread 1 Core 7 Die 0 Socket 0 inCpuSet 1
DEBUG - [hwloc_init_cacheTopology:798] HWLOC Cache Pool ID 0 Level 1 Size 49152 Threads 2
DEBUG - [hwloc_init_cacheTopology:798] HWLOC Cache Pool ID 1 Level 2 Size 524288 Threads 2
DEBUG - [hwloc_init_cacheTopology:798] HWLOC Cache Pool ID 2 Level 3 Size 16777216 Threads 16
DEBUG - [affinity_init:547] Affinity: Socket domains 1
DEBUG - [affinity_init:549] Affinity: CPU die domains 1
DEBUG - [affinity_init:554] Affinity: CPU cores per LLC 8
DEBUG - [affinity_init:557] Affinity: Cache domains 1
DEBUG - [affinity_init:561] Affinity: NUMA domains 1
DEBUG - [affinity_init:562] Affinity: All domains 5
DEBUG - [affinity_addNodeDomain:370] Affinity domain N: 16 HW threads on 8 cores
DEBUG - [affinity_addSocketDomain:401] Affinity domain S0: 16 HW threads on 8 cores
DEBUG - [affinity_addDieDomain:438] Affinity domain D0: 16 HW threads on 8 cores
DEBUG - [affinity_addCacheDomain:474] Affinity domain C0: 16 HW threads on 8 cores
DEBUG - [affinity_addMemoryDomain:504] Affinity domain M0: 16 HW threads on 8 cores
DEBUG - [create_lookups:290] T 0 T2C 0 T2S 0 T2D 0 T2LLC 0 T2M 0
DEBUG - [create_lookups:290] T 1 T2C 1 T2S 0 T2D 0 T2LLC 0 T2M 0
DEBUG - [create_lookups:290] T 2 T2C 2 T2S 0 T2D 0 T2LLC 0 T2M 0
DEBUG - [create_lookups:290] T 3 T2C 3 T2S 0 T2D 0 T2LLC 0 T2M 0
DEBUG - [create_lookups:290] T 4 T2C 4 T2S 0 T2D 0 T2LLC 0 T2M 0
DEBUG - [create_lookups:290] T 5 T2C 5 T2S 0 T2D 0 T2LLC 0 T2M 0
DEBUG - [create_lookups:290] T 6 T2C 6 T2S 0 T2D 0 T2LLC 0 T2M 0
DEBUG - [create_lookups:290] T 7 T2C 7 T2S 0 T2D 0 T2LLC 0 T2M 0
DEBUG - [create_lookups:290] T 8 T2C 0 T2S 0 T2D 0 T2LLC 0 T2M 0
DEBUG - [create_lookups:290] T 9 T2C 1 T2S 0 T2D 0 T2LLC 0 T2M 0
DEBUG - [create_lookups:290] T 10 T2C 2 T2S 0 T2D 0 T2LLC 0 T2M 0
DEBUG - [create_lookups:290] T 11 T2C 3 T2S 0 T2D 0 T2LLC 0 T2M 0
DEBUG - [create_lookups:290] T 12 T2C 4 T2S 0 T2D 0 T2LLC 0 T2M 0
DEBUG - [create_lookups:290] T 13 T2C 5 T2S 0 T2D 0 T2LLC 0 T2M 0
DEBUG - [create_lookups:290] T 14 T2C 6 T2S 0 T2D 0 T2LLC 0 T2M 0
DEBUG - [create_lookups:290] T 15 T2C 7 T2S 0 T2D 0 T2LLC 0 T2M 0
Evaluated CPU string to CPUs: 0,1
Running: ./albm
Using 2 thread(s) (cpuset: 0x3)
In contrast with a single thread I get:
...
Evaluated CPU string to CPUs: 0
[likwid-pin] Main PID -> hwthread 0 - OK
Running: ./albm
Using 1 thread(s) (cpuset: 0x1)
num_steps = 1000
tau / dt ratio = 2.0000000E-02
CFL = 0.6270693
U0 = 1.1547005E-02
Mach = 2.0000000E-02
Re = 1000.000
Everything okay
51486 1081185
In assembly routine:
n = 51485
nnz = 1081185
rownnz_max = 21
rhs_max = 9
Attempting to allocate memory
n = 51485 , nz = 21 , q = 9
sysclock (s) 3.43853497505188
mlups 14.9729455041758
ompwtime (s) 3.43853306770325
mlups 14.9729538096440
Total time (s) 3.43853306770325
Collision time ratio 1.559326410374301E-002
Streaming time ratio 0.984065665745844
If I run the application directly, it works as expected:
(base) ivan@maxwell:~/lrz/rbfxlbm/build$ OMP_NUM_THREADS=2 ./albm
num_steps = 1000
tau / dt ratio = 2.0000000E-02
CFL = 0.6270693
U0 = 1.1547005E-02
Mach = 2.0000000E-02
Re = 1000.000
Everything okay
51486 1081185
In assembly routine:
n = 51485
nnz = 1081185
rownnz_max = 21
rhs_max = 9
Attempting to allocate memory
n = 51485 , nz = 21 , q = 9
sysclock (s) 1.81032705307007
mlups 28.4396107920625
ompwtime (s) 1.81032490730286
mlups 28.4396445013620
Total time (s) 1.81032490730286
Collision time ratio 1.993282543925349E-002
Streaming time ratio 0.979440022346742
Thanks for reporting. I never seen such a behavior.
Does it work with other applications and multiple threads? Are you using some computing library like TBB, Cilk+, SYCL, ...? If it is OpenMP, is it one of the common implementations (GCC, LLVM, Intel)?
No response? I will close the issue soon.
I was only testing GCC and Intel compilers. Potentially TBB via MKL Sparse BLAS, but I'd need to double check this. I'll try again with a simpler application.
Thanks for your response. If you used OpenMP (GCC or Intel), we should try to find the error. My question regarding threading solutions like TBB, Cilk+ or SYCL was just to ensure we are not talking about something exotic.
Can you please try the following with your failing code:
# Rebuild LIKWID with DEBUG=true
$ cd likwid-src
$ make distclean
$ make PREFIX=$LIKWID_INSTALL_DIR DEBUG=true
$ make PREFIX=$LIKWID_INSTALL_DIR DEBUG=true install
$ gdb $LIKWID_INSTALL_DIR/bin/likwid-lua
gdb > run $LIKWID_INSTALL_DIR/bin/likwid-pin -V 3 -c 0,1 ./albm
<fails>
gdb > backtrace
With this, I should be able to locate the exact error.