[core][experimental] Higher than expected overhead for shared memory channels with NCCL

Question

[core][experimental] Higher than expected overhead for shared memory channels with NCCL

Opened this issue a month ago · 2 comments

What happened + What you expected to happen

Microbenchmark results for a single-actor accelerated DAG shows about 30k calls/s, or about 30us/call. That is consistent with other microbenchmarks that @jackhumphries ran for channel performance, showing low 10s of us / channel op.

However, a microbenchmark for the recently added NCCL transport shows about 5.8k calls/s for NCCL alone and 3.2k calls/s for DAG+NCCL. This translates to about 130us / DAG call, more than 4x what's expected.

Versions / Dependencies

3.0dev

Reproduction script

See linked microbenchmarks.

Issue Severity

None

Answer 1 · 2024-05-23T22:37:50.000Z

Discussed during standup today > goal is to improve the NCCL performance to be within 50% versus 4x. This will also help with vLLM performance (may or may not impact) but it will also draw that closer as well.

Answer 2 · 2024-06-10T22:35:59.000Z

Environment setup - about to dive into debugging the overhead.