Does it support a link-aggregated ethernet NICs?
jinserk opened this issue · 5 comments
Hi,
I have two nodes with 2 ethernet NICs bonded with link aggregation, but they look like using only single NIC bandwidth when I use MPI or NCCL with PyTorch. I wonder how about Gloo, if it supports such multiple connections between two nodes if they have bonded NICs. So let me ask whether the current Gloo supports multiple NICs and link aggregation? If not, do you have any plan to support it?
Thanks!
Thanks for the question!
This is not handled by Gloo itself, as there is only a single connection per communication pair per context. We aim to solve this from the PyTorch side (c10d) by supporting multiple ProcessGroup
instances per process. There is no global state so you can create as many independent process groups as you like. Then you can maximize performance by, for example, using them in aggregate, or round robin (provided you have multiple operations running concurrently). The current code in PyTorch still only uses a single Gloo context per process group, but this is the direction we're going in. In some preliminary benchmarks I have found this to work very well, especially on machines with very fast network cards. I expect the same will hold for bonded NICs, where I expect connections to be pinned to either NIC. Multiple connections allow for using all NICs and will therefore hopefully maximize performance for you.
How are you planning on using this? E.g. through the distributed data parallel wrappers or through torch.distributed
directly?
Hi @pietern ! Thank you for your kind and detailed reply! It would be really cool if it is fulfilled.
As you said, I'd like to use it with DDP of PyTorch if it supports such configuration. But if I need to, I can use torch.distributed
directly.
No problem! This will be supported through DDP for sure, so no worries of having to do anything custom.
What does your bonded NIC setup look like? What NIC model and link speed?
Here is the ethernet NICs:
01:00.0 Ethernet controller: Intel Corporation I350 Gigabit Network Connection (rev 01)
01:00.1 Ethernet controller: Intel Corporation I350 Gigabit Network Connection (rev 01)
and the network scripts:
DEVICE=bond0
ONBOOT=yes
NETMASK=255.255.255.0
IPADDR=xxx.xxx.xxx.xxx
GATEWAY=xxx.xxx.xxx.xxx
TYPE=bond
BONDING_OPTS="mode=6 miimon=1000 updelay=5000"
BONDING_MASTER=yes
and the individual network connection is:
# Generated by dracut initrd
NAME="eno1"
DEVICE="eno1"
ONBOOT=yes
NETBOOT=yes
UUID="xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
IPV6INIT=yes
#BOOTPROTO=dhcp
TYPE=Ethernet
SLAVE=yes
MASTER=bond0
The NIC itself is not so super fast that the bonded link speed might be around 1GB/s.
Sorry for the delayed response.
If I understand correctly, NIC bonding will load balance at the level of IP connections. If you run a job with 2 machines, there will only be a single connection between the two, and it will be handled by one of the NICs that contributed to the bonded aggregate NIC. You'll need multiple connections between the machines to make use of the aggregate bandwidth.
Last week I submitted a PR to PyTorch to do this by creating multiple Gloo contexts. If you're unlucky you may still have two connections land on the same NIC, so you may need to create a few more before you start to see improved bandwidth. See pytorch/pytorch#22978 for the code.
I'm closing the issue in favor of an enhancement issue I created at #190.