openucx/ucx

Assertion `worker->inprogress++ == 0' failed

Opened this issue · 3 comments

Describe the bug

I have compiled the code in my laptop and there it executes perfectly, however when I port the code to a server I am sometimes running into this error, however this does not happen always. I am not sure when this error arises.

[gs07r1b29:3935050:2:3935480] ucp_worker.c:2990 Assertion `worker->inprogress++ == 0' failed
backtrace (tid:3935480) ====
0 /gpfs/apps/MN5/GPP/UCX/1.16.0/INTEL/lib/libucs.so.0.0.0(ucs_handle_error+0x3f4) [0x7f5584b05704]
1 /gpfs/apps/MN5/GPP/UCX/1.16.0/INTEL/lib/libucs.so.0.0.0(ucs_fatal_error_message+0xec) [0x7f5584b02b9c]
2 /gpfs/apps/MN5/GPP/UCX/1.16.0/INTEL/lib/libucs.so.0.0.0(ucs_fatal_error_format+0x103) [0x7f5584b02aa3]
3 /gpfs/apps/MN5/GPP/UCX/1.16.0/INTEL/lib/libucp.so.0.0.0(ucp_worker_progress+0x1a3) [0x7f5552cfb433]
4 [0x7f5515415e5b]

[gs07r1b29:3935050:1:3935478] ucp_worker.c:2995 Assertion `--worker->inprogress == 0' failed
backtrace (tid:3935478) ====
0 /gpfs/apps/MN5/GPP/UCX/1.16.0/INTEL/lib/libucs.so.0.0.0(ucs_handle_error+0x3f4) [0x7f5584b05704]
1 /gpfs/apps/MN5/GPP/UCX/1.16.0/INTEL/lib/libucs.so.0.0.0(ucs_fatal_error_message+0xec) [0x7f5584b02b9c]
2 /gpfs/apps/MN5/GPP/UCX/1.16.0/INTEL/lib/libucs.so.0.0.0(ucs_fatal_error_format+0x103) [0x7f5584b02aa3]
3 /gpfs/apps/MN5/GPP/UCX/1.16.0/INTEL/lib/libucp.so.0.0.0(ucp_worker_progress+0xd3) [0x7f5552cfb363]
4 [0x7f5515415e5b]

Steps to Reproduce

Executing an application that involves send stream / receive stream using jucx, and follows a structure similar to the UCXBenchmark

Setup and versions

  • Linux + CPU architecture x86_64
    Using
    export UCX_TLS=ud_mlx5
    export UCX_NET_DEVICES=mlx5_2:1

Seems an issue with enabling multi-threading support. If the application is multi-threaded, UCX has to be compiled with multi-thread support (--enable-mt) and ucp_worker_create has to be called with ucp_worker_params_t::thread_mode= UCS_THREAD_MODE_MULTI

I am using jucx, the Java binding, how do I have to call it in that case?

On Mon, Aug 5, 2024 at 9:17 AM Yossi Itigin @.> wrote: Seems an issue with enabling multi-threading support. If the application is multi-threaded, UCX has to be compiled with multi-thread support (--enable-mt) and ucp_worker_create has to be called with ucp_worker_params_t::thread_mode= UCS_THREAD_MODE_MULTI — Reply to this email directly, view it on GitHub <#10039 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/ALGUSLZ3367LSDPKVXWQ4HLZP6QTPAVCNFSM6AAAAABL3ZQNPGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDENRZGQ2DGMJSGQ . You are receiving this because you authored the thread.Message ID: @.>

See https://github.com/openucx/ucx/blob/master/bindings/java/src/test/java/org/openucx/jucx/UcpWorkerTest.java#L41 - requestThreadSafety