Why the effective B/W for each NVlink is 20GB/s instead of 25GB/s

Question

Why the effective B/W for each NVlink is 20GB/s instead of 25GB/s

gabbychen opened this issue 6 months ago · 2 comments

Hi, I want to check the A100 server communication performance for model training
Now I found the B/W per link is <20GB/s instead of 25GB.
AFSIK the B/W efficiency can reach to near 90% when the payload reaches to 128B (package header payload: 16B).
(Refer paper: "Scalable Irregular Parallelism with GPUs: Getting CPUs Out of the Way". )
With this payload the effective B/W can reach about 22GB/s
May I know if there is other overhead in the package so the efficient B/W is 20GB/s?
And is it possible I can change the S/W configuration (maybe H/W?) or source code so that the effective B/W can be improved?

Answer 1 · 2024-10-24T17:50:57.000Z

Your assumption is only true for unidirectional traffic. For bidirectional traffic you have extra overhead. NCCL almost always uses both directions of each link (except maybe on 2-ranks broadcast), hence the extra overhead.

Answer 2 · 2024-11-04T07:49:14.000Z

Hi sjeaugey

Thanks for your feedback
I've tried the command P2P Connectivity Matrix,
I thinks that will be unidirectional traffic,
but the result shows it's still less than 20GB/link (totally less than 240GB/s).

Could u show the reason?