BIGBALLON/distribuuuu

Bro,what's the difference between the `rank` and `local_rank`?

buaacarzp opened this issue · 2 comments

rank = int(os.environ["RANK"])
local_rank = int(os.environ["LOCAL_RANK"])

  • Reference:
  1. https://pytorch.org/tutorials/beginner/dist_overview.html
  2. https://github.com/pytorch/examples/blob/master/distributed/ddp/README.md
  3. https://zhuanlan.zhihu.com/p/360405558
  • Explanation

Assume you have 4 machines with 32 GPUs (8GPUs per machine)

  • world_size is 32
  • rank are 0 ~ 31
  • local_rank are 0 ~ 7, since one machine has only 8 GPUs, so you need use the device number 0 ~ 7

net = DDP(net, device_ids=[local_rank], output_device=local_rank)

I suggest you read dist_overview first.

  • Reference:
  1. https://pytorch.org/tutorials/beginner/dist_overview.html
  2. https://github.com/pytorch/examples/blob/master/distributed/ddp/README.md
  3. https://zhuanlan.zhihu.com/p/360405558
  • Explanation

Assume you have 4 machines with 32 GPUs (8GPUs per machine)

  • world_size is 32
  • rank are 0 ~ 31
  • local_rank are 0 ~ 7, since one machine have only 8 GPUs, so you need use the device number 0 ~ 7

net = DDP(net, device_ids=[local_rank], output_device=local_rank)

I suggest you read dist_overview first.

goog job, I like you very much !