Legion + UCX network: slower compared to GASNet

Question

Legion + UCX network: slower compared to GASNet

Closed this issue 3 months ago · 20 comments

Running Legion with UCX results in a significant slowdown (i.e. UCX is about 2 times slower than GASNet+ibv for our test case).

We run our test on multiple nodes (CPU only), each having 36 cores (2 sockets) and equipped with ConnectX-4 network cards. The 2x slowdown was also observed on single-node runs. Also, we tried different UCX configurations (e.g. with xpmem configured manually).

Below, an example of UCX configuration that we have tested:

#define UCX_CONFIGURE_FLAGS       "--disable-logging --disable-debug --disable-assertions --disable-params-check --enable-optimizations  --with-verbs --with-mlx5-dv --enable-mt"

#      Transport: self
#      Transport: tcp
#      Transport: tcp
#      Transport: tcp
#      Transport: sysv
#      Transport: posix
#      Transport: dc_mlx5
#      Transport: rc_verbs
#      Transport: rc_mlx5
#      Transport: ud_verbs
#      Transport: ud_mlx5
#      Transport: cma
#      Transport: knem

Are we missing some UCX configuration details?

Answer 1 · 2024-03-14T16:18:05.000Z

cc @SeyedMir

Answer 2 · 2024-03-14T18:08:46.000Z

Do you use -ll:bgworkpin 1? If not, can you please rerun with that option?

Answer 3 · 2024-03-14T19:35:33.000Z

I have just re-run with -ll:bgworkpin 1 and obtained similar results.

Answer 4 · 2024-03-14T19:42:08.000Z

What UCX version are you using? The output of ucx_info -v

Answer 5 · 2024-03-14T19:45:00.000Z

UCX 1.15.0

Answer 6 · 2024-03-14T19:52:48.000Z

Let's get the output with -level ucp=2 and also UCX logs by setting UCX_LOG_LEVEL=debug UCX_LOG_FILE=<some_path>/ucx_log.%h.%p UCX_PROTO_INFO=y.

Answer 7 · 2024-03-14T21:55:38.000Z

Here are the logs for a small run, single node:

ucp:

[1 - 14f28eef5dc0]    0.000000 {2}{ucp}: bootstrapped UCP network module
[1 - 14f28eef5dc0]    0.000000 {2}{ucp}: UCX_ZCOPY_THRESH modified to 2048 for context 0x1ff2f30
[1 - 14f28eef5dc0]    0.000000 {2}{ucp}: UCX_IB_SEG_SIZE modified to 8192 for context 0x1ff2f30
[1 - 14f28eef5dc0]    0.000000 {2}{ucp}: initialized ucp context 0x1ff2f30 max_am_header 3945
[1 - 14f28eef5dc0]    0.000000 {2}{ucp}: initialized 1 ucp contexts
[1 - 14f28eef5dc0]    0.000000 {2}{ucp}: total num_eps 1
[1 - 14f28eef5dc0]    0.000000 {2}{ucp}: attached segments
[1 - 14f28eef5dc0]    0.002592 {4}{threads}: reservation ('utility proc 1d00010000000000') cannot be satisfied
[0 - 14b82d78fdc0]    0.000000 {2}{ucp}: bootstrapped UCP network module
[0 - 14b82d78fdc0]    0.000000 {2}{ucp}: UCX_ZCOPY_THRESH modified to 2048 for context 0x1b6d480
[0 - 14b82d78fdc0]    0.000000 {2}{ucp}: UCX_IB_SEG_SIZE modified to 8192 for context 0x1b6d480
[0 - 14b82d78fdc0]    0.000000 {2}{ucp}: initialized ucp context 0x1b6d480 max_am_header 3945
[0 - 14b82d78fdc0]    0.000000 {2}{ucp}: initialized 1 ucp contexts
[0 - 14b82d78fdc0]    0.000000 {2}{ucp}: total num_eps 1
[0 - 14b82d78fdc0]    0.000000 {2}{ucp}: attached segments
[0 - 14b82d78fdc0]    0.002654 {4}{threads}: reservation ('utility proc 1d00000000000000') cannot be satisfied
...
[1 - 14f28eef5dc0]  419.778561 {2}{ucp}: detaching segments
[1 - 14f28eef5dc0]  419.778599 {2}{ucp}: ended ucp pollers
[1 - 14f28eef5dc0]  419.870804 {2}{ucp}: unmapped ucp-mapped memory
[1 - 14f28eef5dc0]  420.408270 {2}{ucp}: finalized ucp contexts
[1 - 14f28eef5dc0]  420.411369 {2}{ucp}: finalized ucp bootstrap
[0 - 14b82d78fdc0]  419.753321 {2}{ucp}: detaching segments
[0 - 14b82d78fdc0]  419.753354 {2}{ucp}: ended ucp pollers
[0 - 14b82d78fdc0]  419.840665 {2}{ucp}: unmapped ucp-mapped memory
[0 - 14b82d78fdc0]  420.294596 {2}{ucp}: finalized ucp contexts
[0 - 14b82d78fdc0]  420.297354 {2}{ucp}: finalized ucp bootstrap

UCX:
ucx_log_1.log

Answer 8 · 2024-03-15T14:12:07.000Z

What version of Legion are you using? It seems like you're using a relatively old one.

Answer 9 · 2024-03-15T15:13:20.000Z

It is an older version, which corresponds to the following commit : 45afa8e658ae06cb19d8f0374de699b7fe4a197c

Do you believe a newer Legion version would improve the performance when running with UCX?

Answer 10 · 2024-03-15T15:18:29.000Z

Yes, let's test with the latest Legion (or at least something after 13d4101) and then take it from there.

Answer 11 · 2024-03-26T16:15:07.000Z

With the latest Legion I obtained a better performance. However, UCX is still slower on our test case (around 12% slower).

@SeyedMir Is there something else that I could test (e.g. a specific UCX configuration)?

Answer 12 · 2024-04-12T15:18:19.000Z

@SeyedMir would you have other suggestions to improve the Legion+UCX performance?

Answer 13 · 2024-04-12T15:54:23.000Z

Hard to say without profiling. Is this test/code something you can share with me so I can take a look?
Also, can you get UCX logs again and this time also set UCX_PROTO_ENABLE=y UCX_PROTO_INFO=y?

Answer 14 · 2024-04-12T17:27:19.000Z

Let me re-run and obtain the logs.

Our test case is available on Github: https://github.com/flecsi/flecsi/tree/2/tutorial/standalone/poisson
However, it is not directly implemented in Legion. Instead it is implemented using FleCSI.

Answer 15 · 2024-04-16T20:32:45.000Z

Here are the logs for a run on two nodes
ucx_log_0.log
ucx_log_1.log

Answer 16 · 2024-04-24T18:28:41.000Z

By hand-tuning our runs (and using the new Legion release) I was able to obtain better results with UCX on a single node (around 15% better then Gasnet). However, when I try to run on multiple nodes I obtain the following error :

[cn355:11558:0:11678] ib_mlx5_log.c:171  Transport retry count exceeded on mlx5_0:1/IB (synd 0x15 vend 0x81 hw_synd 0/0)
[cn355:11558:0:11678] ib_mlx5_log.c:171  RC QP 0x1cc0 wqe[1069]: SEND --e [inl len 84] [rqpn 0x1040 dlid=88 sl=0 port=1 src_path_bits=0]
[cn355:11561:0:11682] ib_mlx5_log.c:171  Transport retry count exceeded on mlx5_0:1/IB (synd 0x15 vend 0x81 hw_synd 0/0)
[cn355:11561:0:11682] ib_mlx5_log.c:171  RC QP 0x1cd8 wqe[12485]: SEND --e [inl len 84] [rqpn 0x15f0 dlid=97 sl=0 port=1 src_path_bits=0]

Looks like there are too many requests and Infiniband is not able to handle them. Can I change the UCX configuration to avoid this error ?

Answer 17 · 2024-04-24T18:45:31.000Z

That signals an issue in the network. For some reason, packets are being dropped and the underlying network transport (i.e., RC in this case) reaches the maximum retry count and gives up. This is not a UCX or application issue. You can set UCX_RC_RETRY_COUNT to a higher value (the default is 7 I believe) and see if that helps. Though, a healthy network should not really need that.

Answer 18 · 2024-04-24T18:46:34.000Z

I'm curious what tuning helped you get better result.

Answer 19 · 2024-04-24T19:04:05.000Z

I will contact our cluster administrator to see if he can help. I think 7 is the maximum that we can set for UCX_RC_RETRY_COUNT. I am getting the following warning:

[1713985050.849256] [cn337:8455 :0]        rc_iface.c:526  UCX  WARN  using maximal value for RETRY_COUNT (7) instead of 20

Answer 20 · 2024-04-24T19:13:37.000Z

I'm curious what tuning helped you get better result.

Previously we were running with multiple colors per MPI process (launch multiple tasks that potentially would require more communication). Now we run with multiple threads per MPI process (usually we have one MPI process per socket). Each process launch OpenMP kernels.

We also increased the problem size for our tests and used the new Legion release (24.03.00).