StanfordLegion/legion

Legion + UCX network: slower compared to GASNet

Closed this issue · 20 comments

Running Legion with UCX results in a significant slowdown (i.e. UCX is about 2 times slower than GASNet+ibv for our test case).

We run our test on multiple nodes (CPU only), each having 36 cores (2 sockets) and equipped with ConnectX-4 network cards. The 2x slowdown was also observed on single-node runs. Also, we tried different UCX configurations (e.g. with xpmem configured manually).

Below, an example of UCX configuration that we have tested:

#define UCX_CONFIGURE_FLAGS       "--disable-logging --disable-debug --disable-assertions --disable-params-check --enable-optimizations  --with-verbs --with-mlx5-dv --enable-mt"

#      Transport: self
#      Transport: tcp
#      Transport: tcp
#      Transport: tcp
#      Transport: sysv
#      Transport: posix
#      Transport: dc_mlx5
#      Transport: rc_verbs
#      Transport: rc_mlx5
#      Transport: ud_verbs
#      Transport: ud_mlx5
#      Transport: cma
#      Transport: knem

Are we missing some UCX configuration details?

Do you use -ll:bgworkpin 1? If not, can you please rerun with that option?

I have just re-run with -ll:bgworkpin 1 and obtained similar results.

What UCX version are you using? The output of ucx_info -v

UCX 1.15.0

Let's get the output with -level ucp=2 and also UCX logs by setting UCX_LOG_LEVEL=debug UCX_LOG_FILE=<some_path>/ucx_log.%h.%p UCX_PROTO_INFO=y.

Here are the logs for a small run, single node:

ucp:

[1 - 14f28eef5dc0]    0.000000 {2}{ucp}: bootstrapped UCP network module
[1 - 14f28eef5dc0]    0.000000 {2}{ucp}: UCX_ZCOPY_THRESH modified to 2048 for context 0x1ff2f30
[1 - 14f28eef5dc0]    0.000000 {2}{ucp}: UCX_IB_SEG_SIZE modified to 8192 for context 0x1ff2f30
[1 - 14f28eef5dc0]    0.000000 {2}{ucp}: initialized ucp context 0x1ff2f30 max_am_header 3945
[1 - 14f28eef5dc0]    0.000000 {2}{ucp}: initialized 1 ucp contexts
[1 - 14f28eef5dc0]    0.000000 {2}{ucp}: total num_eps 1
[1 - 14f28eef5dc0]    0.000000 {2}{ucp}: attached segments
[1 - 14f28eef5dc0]    0.002592 {4}{threads}: reservation ('utility proc 1d00010000000000') cannot be satisfied
[0 - 14b82d78fdc0]    0.000000 {2}{ucp}: bootstrapped UCP network module
[0 - 14b82d78fdc0]    0.000000 {2}{ucp}: UCX_ZCOPY_THRESH modified to 2048 for context 0x1b6d480
[0 - 14b82d78fdc0]    0.000000 {2}{ucp}: UCX_IB_SEG_SIZE modified to 8192 for context 0x1b6d480
[0 - 14b82d78fdc0]    0.000000 {2}{ucp}: initialized ucp context 0x1b6d480 max_am_header 3945
[0 - 14b82d78fdc0]    0.000000 {2}{ucp}: initialized 1 ucp contexts
[0 - 14b82d78fdc0]    0.000000 {2}{ucp}: total num_eps 1
[0 - 14b82d78fdc0]    0.000000 {2}{ucp}: attached segments
[0 - 14b82d78fdc0]    0.002654 {4}{threads}: reservation ('utility proc 1d00000000000000') cannot be satisfied
...
[1 - 14f28eef5dc0]  419.778561 {2}{ucp}: detaching segments
[1 - 14f28eef5dc0]  419.778599 {2}{ucp}: ended ucp pollers
[1 - 14f28eef5dc0]  419.870804 {2}{ucp}: unmapped ucp-mapped memory
[1 - 14f28eef5dc0]  420.408270 {2}{ucp}: finalized ucp contexts
[1 - 14f28eef5dc0]  420.411369 {2}{ucp}: finalized ucp bootstrap
[0 - 14b82d78fdc0]  419.753321 {2}{ucp}: detaching segments
[0 - 14b82d78fdc0]  419.753354 {2}{ucp}: ended ucp pollers
[0 - 14b82d78fdc0]  419.840665 {2}{ucp}: unmapped ucp-mapped memory
[0 - 14b82d78fdc0]  420.294596 {2}{ucp}: finalized ucp contexts
[0 - 14b82d78fdc0]  420.297354 {2}{ucp}: finalized ucp bootstrap

UCX:
ucx_log_1.log

What version of Legion are you using? It seems like you're using a relatively old one.

It is an older version, which corresponds to the following commit : 45afa8e658ae06cb19d8f0374de699b7fe4a197c

Do you believe a newer Legion version would improve the performance when running with UCX?

Yes, let's test with the latest Legion (or at least something after 13d4101) and then take it from there.

With the latest Legion I obtained a better performance. However, UCX is still slower on our test case (around 12% slower).

@SeyedMir Is there something else that I could test (e.g. a specific UCX configuration)?

@SeyedMir would you have other suggestions to improve the Legion+UCX performance?

Hard to say without profiling. Is this test/code something you can share with me so I can take a look?
Also, can you get UCX logs again and this time also set UCX_PROTO_ENABLE=y UCX_PROTO_INFO=y?

Let me re-run and obtain the logs.

Our test case is available on Github: https://github.com/flecsi/flecsi/tree/2/tutorial/standalone/poisson
However, it is not directly implemented in Legion. Instead it is implemented using FleCSI.

Here are the logs for a run on two nodes
ucx_log_0.log
ucx_log_1.log

By hand-tuning our runs (and using the new Legion release) I was able to obtain better results with UCX on a single node (around 15% better then Gasnet). However, when I try to run on multiple nodes I obtain the following error :

[cn355:11558:0:11678] ib_mlx5_log.c:171  Transport retry count exceeded on mlx5_0:1/IB (synd 0x15 vend 0x81 hw_synd 0/0)
[cn355:11558:0:11678] ib_mlx5_log.c:171  RC QP 0x1cc0 wqe[1069]: SEND --e [inl len 84] [rqpn 0x1040 dlid=88 sl=0 port=1 src_path_bits=0]
[cn355:11561:0:11682] ib_mlx5_log.c:171  Transport retry count exceeded on mlx5_0:1/IB (synd 0x15 vend 0x81 hw_synd 0/0)
[cn355:11561:0:11682] ib_mlx5_log.c:171  RC QP 0x1cd8 wqe[12485]: SEND --e [inl len 84] [rqpn 0x15f0 dlid=97 sl=0 port=1 src_path_bits=0]

Looks like there are too many requests and Infiniband is not able to handle them. Can I change the UCX configuration to avoid this error ?

That signals an issue in the network. For some reason, packets are being dropped and the underlying network transport (i.e., RC in this case) reaches the maximum retry count and gives up. This is not a UCX or application issue. You can set UCX_RC_RETRY_COUNT to a higher value (the default is 7 I believe) and see if that helps. Though, a healthy network should not really need that.

I'm curious what tuning helped you get better result.

I will contact our cluster administrator to see if he can help. I think 7 is the maximum that we can set for UCX_RC_RETRY_COUNT. I am getting the following warning:

[1713985050.849256] [cn337:8455 :0]        rc_iface.c:526  UCX  WARN  using maximal value for RETRY_COUNT (7) instead of 20

I'm curious what tuning helped you get better result.

Previously we were running with multiple colors per MPI process (launch multiple tasks that potentially would require more communication). Now we run with multiple threads per MPI process (usually we have one MPI process per socket). Each process launch OpenMP kernels.

We also increased the problem size for our tests and used the new Legion release (24.03.00).