RRZE-HPC/likwid

[BUG] Silent failure on multi-threaded runs

ivan-pi opened this issue · 5 comments

Describe the bug

likwid-pin appears to silently fail when using more than one thread, judging by the fact that the command exits almost immediately, and nothing is written to standard output.

To Reproduce

  • LIKWID command and/or API usage: $ likwid-pin -V 2 -c 0,1 ./albm

  • LIKWID version and download source (Github, FTP, package manger, ...): likwid-pin -- Version 5.3.0 (commit: 0123456789)

  • Operating system: Linux maxwell 5.15.0-100-generic #110~20.04.1-Ubuntu SMP

  • Does your application use libraries like MPI, OpenMP or Pthreads? Yes, OpenMP.

  • Are you using the MarkerAPI (CPU code instrumentation)? No.

To Reproduce with a LIKWID command

Please supply the output of the command with -V 3 added to the command:

(base) ivan@maxwell:~/lrz/rbfxlbm/build$ likwid-pin -V 3 -c 0,1 ./albm
DEBUG - [hwloc_init_cpuInfo:359] HWLOC CpuInfo Family 6 Model 167 Stepping 1 Vendor 0x0 Part 0x0 isIntel 1 numHWThreads 16 activeHWThreads 16
DEBUG - [hwloc_init_nodeTopology:568] HWLOC Thread Pool PU 0 Thread 0 Core 0 Die 0 Socket 0 inCpuSet 1
DEBUG - [hwloc_init_nodeTopology:568] HWLOC Thread Pool PU 8 Thread 1 Core 0 Die 0 Socket 0 inCpuSet 1
DEBUG - [hwloc_init_nodeTopology:568] HWLOC Thread Pool PU 1 Thread 0 Core 1 Die 0 Socket 0 inCpuSet 1
DEBUG - [hwloc_init_nodeTopology:568] HWLOC Thread Pool PU 9 Thread 1 Core 1 Die 0 Socket 0 inCpuSet 1
DEBUG - [hwloc_init_nodeTopology:568] HWLOC Thread Pool PU 2 Thread 0 Core 2 Die 0 Socket 0 inCpuSet 1
DEBUG - [hwloc_init_nodeTopology:568] HWLOC Thread Pool PU 10 Thread 1 Core 2 Die 0 Socket 0 inCpuSet 1
DEBUG - [hwloc_init_nodeTopology:568] HWLOC Thread Pool PU 3 Thread 0 Core 3 Die 0 Socket 0 inCpuSet 1
DEBUG - [hwloc_init_nodeTopology:568] HWLOC Thread Pool PU 11 Thread 1 Core 3 Die 0 Socket 0 inCpuSet 1
DEBUG - [hwloc_init_nodeTopology:568] HWLOC Thread Pool PU 4 Thread 0 Core 4 Die 0 Socket 0 inCpuSet 1
DEBUG - [hwloc_init_nodeTopology:568] HWLOC Thread Pool PU 12 Thread 1 Core 4 Die 0 Socket 0 inCpuSet 1
DEBUG - [hwloc_init_nodeTopology:568] HWLOC Thread Pool PU 5 Thread 0 Core 5 Die 0 Socket 0 inCpuSet 1
DEBUG - [hwloc_init_nodeTopology:568] HWLOC Thread Pool PU 13 Thread 1 Core 5 Die 0 Socket 0 inCpuSet 1
DEBUG - [hwloc_init_nodeTopology:568] HWLOC Thread Pool PU 6 Thread 0 Core 6 Die 0 Socket 0 inCpuSet 1
DEBUG - [hwloc_init_nodeTopology:568] HWLOC Thread Pool PU 14 Thread 1 Core 6 Die 0 Socket 0 inCpuSet 1
DEBUG - [hwloc_init_nodeTopology:568] HWLOC Thread Pool PU 7 Thread 0 Core 7 Die 0 Socket 0 inCpuSet 1
DEBUG - [hwloc_init_nodeTopology:568] HWLOC Thread Pool PU 15 Thread 1 Core 7 Die 0 Socket 0 inCpuSet 1
DEBUG - [hwloc_init_cacheTopology:798] HWLOC Cache Pool ID 0 Level 1 Size 49152 Threads 2
DEBUG - [hwloc_init_cacheTopology:798] HWLOC Cache Pool ID 1 Level 2 Size 524288 Threads 2
DEBUG - [hwloc_init_cacheTopology:798] HWLOC Cache Pool ID 2 Level 3 Size 16777216 Threads 16
DEBUG - [affinity_init:547] Affinity: Socket domains 1
DEBUG - [affinity_init:549] Affinity: CPU die domains 1
DEBUG - [affinity_init:554] Affinity: CPU cores per LLC 8
DEBUG - [affinity_init:557] Affinity: Cache domains 1
DEBUG - [affinity_init:561] Affinity: NUMA domains 1
DEBUG - [affinity_init:562] Affinity: All domains 5
DEBUG - [affinity_addNodeDomain:370] Affinity domain N: 16 HW threads on 8 cores
DEBUG - [affinity_addSocketDomain:401] Affinity domain S0: 16 HW threads on 8 cores
DEBUG - [affinity_addDieDomain:438] Affinity domain D0: 16 HW threads on 8 cores
DEBUG - [affinity_addCacheDomain:474] Affinity domain C0: 16 HW threads on 8 cores
DEBUG - [affinity_addMemoryDomain:504] Affinity domain M0: 16 HW threads on 8 cores
DEBUG - [create_lookups:290] T 0 T2C 0 T2S 0 T2D 0 T2LLC 0 T2M 0
DEBUG - [create_lookups:290] T 1 T2C 1 T2S 0 T2D 0 T2LLC 0 T2M 0
DEBUG - [create_lookups:290] T 2 T2C 2 T2S 0 T2D 0 T2LLC 0 T2M 0
DEBUG - [create_lookups:290] T 3 T2C 3 T2S 0 T2D 0 T2LLC 0 T2M 0
DEBUG - [create_lookups:290] T 4 T2C 4 T2S 0 T2D 0 T2LLC 0 T2M 0
DEBUG - [create_lookups:290] T 5 T2C 5 T2S 0 T2D 0 T2LLC 0 T2M 0
DEBUG - [create_lookups:290] T 6 T2C 6 T2S 0 T2D 0 T2LLC 0 T2M 0
DEBUG - [create_lookups:290] T 7 T2C 7 T2S 0 T2D 0 T2LLC 0 T2M 0
DEBUG - [create_lookups:290] T 8 T2C 0 T2S 0 T2D 0 T2LLC 0 T2M 0
DEBUG - [create_lookups:290] T 9 T2C 1 T2S 0 T2D 0 T2LLC 0 T2M 0
DEBUG - [create_lookups:290] T 10 T2C 2 T2S 0 T2D 0 T2LLC 0 T2M 0
DEBUG - [create_lookups:290] T 11 T2C 3 T2S 0 T2D 0 T2LLC 0 T2M 0
DEBUG - [create_lookups:290] T 12 T2C 4 T2S 0 T2D 0 T2LLC 0 T2M 0
DEBUG - [create_lookups:290] T 13 T2C 5 T2S 0 T2D 0 T2LLC 0 T2M 0
DEBUG - [create_lookups:290] T 14 T2C 6 T2S 0 T2D 0 T2LLC 0 T2M 0
DEBUG - [create_lookups:290] T 15 T2C 7 T2S 0 T2D 0 T2LLC 0 T2M 0
Evaluated CPU string to CPUs: 0,1
Running: ./albm
Using 2 thread(s) (cpuset: 0x3)

In contrast with a single thread I get:

...
Evaluated CPU string to CPUs: 0
[likwid-pin] Main PID -> hwthread 0 - OK
Running: ./albm
Using 1 thread(s) (cpuset: 0x1)
 num_steps =         1000
 tau / dt ratio =   2.0000000E-02
 CFL  =   0.6270693    
 U0   =   1.1547005E-02
 Mach =   2.0000000E-02
 Re   =    1000.000    
 Everything okay
       51486     1081185
 In assembly routine:
    n   =        51485
    nnz =      1081185
    rownnz_max =           21
    rhs_max =            9
 Attempting to allocate memory
 n =        51485 , nz =           21 , q =            9
 sysclock (s)    3.43853497505188     
 mlups    14.9729455041758     
 ompwtime (s)    3.43853306770325     
 mlups    14.9729538096440     
 Total time (s)   3.43853306770325     
 Collision time ratio   1.559326410374301E-002
 Streaming time ratio   0.984065665745844     

If I run the application directly, it works as expected:

(base) ivan@maxwell:~/lrz/rbfxlbm/build$ OMP_NUM_THREADS=2 ./albm
 num_steps =         1000
 tau / dt ratio =   2.0000000E-02
 CFL  =   0.6270693    
 U0   =   1.1547005E-02
 Mach =   2.0000000E-02
 Re   =    1000.000    
 Everything okay
       51486     1081185
 In assembly routine:
    n   =        51485
    nnz =      1081185
    rownnz_max =           21
    rhs_max =            9
 Attempting to allocate memory
 n =        51485 , nz =           21 , q =            9
 sysclock (s)    1.81032705307007     
 mlups    28.4396107920625     
 ompwtime (s)    1.81032490730286     
 mlups    28.4396445013620     
 Total time (s)   1.81032490730286     
 Collision time ratio   1.993282543925349E-002
 Streaming time ratio   0.979440022346742     

Thanks for reporting. I never seen such a behavior.

Does it work with other applications and multiple threads? Are you using some computing library like TBB, Cilk+, SYCL, ...? If it is OpenMP, is it one of the common implementations (GCC, LLVM, Intel)?

No response? I will close the issue soon.

I was only testing GCC and Intel compilers. Potentially TBB via MKL Sparse BLAS, but I'd need to double check this. I'll try again with a simpler application.

Thanks for your response. If you used OpenMP (GCC or Intel), we should try to find the error. My question regarding threading solutions like TBB, Cilk+ or SYCL was just to ensure we are not talking about something exotic.

Can you please try the following with your failing code:

# Rebuild LIKWID with DEBUG=true
$ cd likwid-src
$ make distclean
$ make PREFIX=$LIKWID_INSTALL_DIR DEBUG=true
$ make PREFIX=$LIKWID_INSTALL_DIR DEBUG=true install
$ gdb $LIKWID_INSTALL_DIR/bin/likwid-lua
gdb > run $LIKWID_INSTALL_DIR/bin/likwid-pin -V 3 -c 0,1 ./albm
<fails>
gdb > backtrace

With this, I should be able to locate the exact error.