pytorch/examples

`local_rank` or `rank` for multi-node FSDP

Emerald01 opened this issue · 0 comments

I am wondering for multi-node FSDP, does local_rank and rank have any obvious difference here?
I think I understand that local_rank is the rank within a node.

I see in a few places it looks like local_rank is specifically used

For example

https://github.com/pytorch/examples/blob/main/distributed/FSDP/T5_training.py#L111
torch.cuda.set_device(local_rank)

and
https://github.com/pytorch/examples/blob/main/distributed/FSDP/utils/train_utils.py#L48
batch[key] = batch[key].to(local_rank)

Is there any problem if using rank instead?