`local_rank` or `rank` for multi-node FSDP
Emerald01 opened this issue · 0 comments
Emerald01 commented
I am wondering for multi-node FSDP, does local_rank
and rank
have any obvious difference here?
I think I understand that local_rank
is the rank within a node.
I see in a few places it looks like local_rank
is specifically used
For example
https://github.com/pytorch/examples/blob/main/distributed/FSDP/T5_training.py#L111
torch.cuda.set_device(local_rank)
and
https://github.com/pytorch/examples/blob/main/distributed/FSDP/utils/train_utils.py#L48
batch[key] = batch[key].to(local_rank)
Is there any problem if using rank
instead?