Why the effective B/W for each NVlink is 20GB/s instead of 25GB/s
gabbychen opened this issue · 2 comments
Hi, I want to check the A100 server communication performance for model training
Now I found the B/W per link is <20GB/s instead of 25GB.
AFSIK the B/W efficiency can reach to near 90% when the payload reaches to 128B (package header payload: 16B).
(Refer paper: "Scalable Irregular Parallelism with GPUs: Getting CPUs Out of the Way". )
With this payload the effective B/W can reach about 22GB/s
May I know if there is other overhead in the package so the efficient B/W is 20GB/s?
And is it possible I can change the S/W configuration (maybe H/W?) or source code so that the effective B/W can be improved?
Your assumption is only true for unidirectional traffic. For bidirectional traffic you have extra overhead. NCCL almost always uses both directions of each link (except maybe on 2-ranks broadcast), hence the extra overhead.
Hi sjeaugey
Thanks for your feedback
I've tried the command P2P Connectivity Matrix,
I thinks that will be unidirectional traffic,
but the result shows it's still less than 20GB/link (totally less than 240GB/s).
Could u show the reason?