msr-fiddle/pipedream

GPU Peer2Peer communication via --num_ranks_in_server argument

Opened this issue · 1 comments

I wanted to get a better understanding of the argument --num_ranks_in_servers in the image classification runtime referenced here. I had the following questions:

  1. Should it be equal to the number of GPUs per node?
  2. In the readme for runtime, I do not see any mention of this argument in the instructions to run the framework. From my understanding, assigning the correct value for this argument is important to enable Peer2Peer communication among GPUs using gloo. Otherwise, pipedream switches to a suboptimal communication routine by yanking the data off to the CPU and then sending it via gloo. Please let me know if I should be using this argument to operate the framework in it's most optimal format. If yes, I'd suggest an update of the readme file.

Wanted to follow up on this, we are using pipedream in our experiments and wanted to be sure of the optimal settings to run the framework.