multi-node training error:Invalid device id(slave's machine no main machine)

Question

multi-node training error:Invalid device id(slave's machine no main machine)

Closed this issue 15 days ago · 7 comments

follow your distributed.md. my test devices: 2machines in same network, all have 8 nvidia4090 GPU

my accelerate config of slave ip machine

compute_environment: LOCAL_MACHINE
debug: false
deepspeed_config:
deepspeed_multinode_launcher: standard
gradient_accumulation_steps: 1
zero3_init_flag: false
zero_stage: 1
distributed_type: DEEPSPEED
downcast_bf16: 'no'
enable_cpu_affinity: false
machine_rank: 1
main_process_ip: 192.168.0.207
main_process_port: 8080
main_training_function: main
mixed_precision: bf16
num_machines: 2
num_processes: 16
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false
dynamo_config:
# Update this from NO to INDUCTOR
dynamo_backend: INDUCTOR
dynamo_mode: max-autotune
dynamo_use_dynamic: false
dynamo_use_fullgraph: false

the error:
175ff800 count 1 datatype 7 op 0 root 0 comm 0x55d483af0880 [nranks=16] stream 0x55d483af0730
192-168-1-86:395110:395110 [6] NCCL INFO AllReduce: opCount 9 sendbuff 0x7fe1375ff800 recvbuff 0x7fe1375ff800 count 1 datatype 7 op 0 root 0 comm 0x55c52129f0b0 [nranks=16] stream 0x55c52129ef60
192-168-1-86:395111:395111 [7] NCCL INFO AllReduce: opCount a sendbuff 0x7f0a2f5ffa00 recvbuff 0x7f0a2f5ffa00 count 1 datatype 7 op 0 root 0 comm 0x5592ecd257c0 [nranks=16] stream 0x5592ecd25670
192-168-1-86:395106:395106 [2] NCCL INFO AllReduce: opCount a sendbuff 0x7f37335ffa00 recvbuff 0x7f37335ffa00 count 1 datatype 7 op 0 root 0 comm 0x55f16ead52e0 [nranks=16] stream 0x55f16ead5190
192-168-1-86:395107:395107 [3] NCCL INFO AllReduce: opCount a sendbuff 0x7efe4b5ffa00 recvbuff 0x7efe4b5ffa00 count 1 datatype 7 op 0 root 0 comm 0x564dbc4379f0 [nranks=16] stream 0x564dbc4378a0
192-168-1-86:395108:395108 [4] NCCL INFO AllReduce: opCount a sendbuff 0x7f7ccb5ffa00 recvbuff 0x7f7ccb5ffa00 count 1 datatype 7 op 0 root 0 comm 0x564eab0ad5d0 [nranks=16] stream 0x564eab0ad480
192-168-1-86:395109:395109 [5] NCCL INFO AllReduce: opCount a sendbuff 0x7fd9df5ffa00 recvbuff 0x7fd9df5ffa00 count 1 datatype 7 op 0 root 0 comm 0x564904454e90 [nranks=16] stream 0x564904454d40
192-168-1-86:395104:395104 [0] NCCL INFO AllReduce: opCount a sendbuff 0x7fa9fb5ffa00 recvbuff 0x7fa9fb5ffa00 count 1 datatype 7 op 0 root 0 comm 0x5611268d02a0 [nranks=16] stream 0x5611268d0150
192-168-1-86:395110:395110 [6] NCCL INFO AllReduce: opCount a sendbuff 0x7fe1375ffa00 recvbuff 0x7fe1375ffa00 count 1 datatype 7 op 0 root 0 comm 0x55c52129f0b0 [nranks=16] stream 0x55c52129ef60
192-168-1-86:395105:395105 [1] NCCL INFO AllReduce: opCount a sendbuff 0x7f92175ffa00 recvbuff 0x7f92175ffa00 count 1 datatype 7 op 0 root 0 comm 0x55d483af0880 [nranks=16] stream 0x55d483af0730
192-168-1-86:395105:395105 [1] NCCL INFO AllReduce: opCount b sendbuff 0x7f92175ff800 recvbuff 0x7f92175ff800 count 1 datatype 7 op 0 root 0 comm 0x55d483af0880 [nranks=16] stream 0x55d483af0730
192-168-1-86:395108:395108 [4] NCCL INFO AllReduce: opCount b sendbuff 0x7f7ccb5ff800 recvbuff 0x7f7ccb5ff800 count 1 datatype 7 op 0 root 0 comm 0x564eab0ad5d0 [nranks=16] stream 0x564eab0ad480
192-168-1-86:395111:395111 [7] NCCL INFO AllReduce: opCount b sendbuff 0x7f0a2f5ff800 recvbuff 0x7f0a2f5ff800 count 1 datatype 7 op 0 root 0 comm 0x5592ecd257c0 [nranks=16] stream 0x5592ecd25670
192-168-1-86:395109:395109 [5] NCCL INFO AllReduce: opCount b sendbuff 0x7fd9df5ff800 recvbuff 0x7fd9df5ff800 count 1 datatype 7 op 0 root 0 comm 0x564904454e90 [nranks=16] stream 0x564904454d40
192-168-1-86:395106:395106 [2] NCCL INFO AllReduce: opCount b sendbuff 0x7f37335ff800 recvbuff 0x7f37335ff800 count 1 datatype 7 op 0 root 0 comm 0x55f16ead52e0 [nranks=16] stream 0x55f16ead5190
192-168-1-86:395110:395110 [6] NCCL INFO AllReduce: opCount b sendbuff 0x7fe1375ff800 recvbuff 0x7fe1375ff800 count 1 datatype 7 op 0 root 0 comm 0x55c52129f0b0 [nranks=16] stream 0x55c52129ef60
192-168-1-86:395104:395104 [0] NCCL INFO AllReduce: opCount b sendbuff 0x7fa9fb5ff800 recvbuff 0x7fa9fb5ff800 count 1 datatype 7 op 0 root 0 comm 0x5611268d02a0 [nranks=16] stream 0x5611268d0150
192-168-1-86:395107:395107 [3] NCCL INFO AllReduce: opCount b sendbuff 0x7efe4b5ff800 recvbuff 0x7efe4b5ff800 count 1 datatype 7 op 0 root 0 comm 0x564dbc4379f0 [nranks=16] stream 0x564dbc4378a0
192-168-1-86:395105:395105 [1] NCCL INFO AllReduce: opCount c sendbuff 0x7f92175ffa00 recvbuff 0x7f92175ffa00 count 1 datatype 7 op 0 root 0 comm 0x55d483af0880 [nranks=16] stream 0x55d483af0730
192-168-1-86:395108:395108 [4] NCCL INFO AllReduce: opCount c sendbuff 0x7f7ccb5ffa00 recvbuff 0x7f7ccb5ffa00 count 1 datatype 7 op 0 root 0 comm 0x564eab0ad5d0 [nranks=16] stream 0x564eab0ad480
192-168-1-86:395106:395106 [2] NCCL INFO AllReduce: opCount c sendbuff 0x7f37335ffa00 recvbuff 0x7f37335ffa00 count 1 datatype 7 op 0 root 0 comm 0x55f16ead52e0 [nranks=16] stream 0x55f16ead5190
192-168-1-86:395111:395111 [7] NCCL INFO AllReduce: opCount c sendbuff 0x7f0a2f5ffa00 recvbuff 0x7f0a2f5ffa00 count 1 datatype 7 op 0 root 0 comm 0x5592ecd257c0 [nranks=16] stream 0x5592ecd25670
192-168-1-86:395110:395110 [6] NCCL INFO AllReduce: opCount c sendbuff 0x7fe1375ffa00 recvbuff 0x7fe1375ffa00 count 1 datatype 7 op 0 root 0 comm 0x55c52129f0b0 [nranks=16] stream 0x55c52129ef60
192-168-1-86:395109:395109 [5] NCCL INFO AllReduce: opCount c sendbuff 0x7fd9df5ffa00 recvbuff 0x7fd9df5ffa00 count 1 datatype 7 op 0 root 0 comm 0x564904454e90 [nranks=16] stream 0x564904454d40
192-168-1-86:395107:395107 [3] NCCL INFO AllReduce: opCount c sendbuff 0x7efe4b5ffa00 recvbuff 0x7efe4b5ffa00 count 1 datatype 7 op 0 root 0 comm 0x564dbc4379f0 [nranks=16] stream 0x564dbc4378a0
192-168-1-86:395104:395104 [0] NCCL INFO AllReduce: opCount c sendbuff 0x7fa9fb5ffa00 recvbuff 0x7fa9fb5ffa00 count 1 datatype 7 op 0 root 0 comm 0x5611268d02a0 [nranks=16] stream 0x5611268d0150
Invalid device id
Traceback (most recent call last):
File "/home/ubuntu/SimpleTuner/train.py", line 37, in
trainer.init_load_base_model()
File "/home/ubuntu/SimpleTuner/helpers/training/trainer.py", line 613, in init_load_base_model
self.unet, self.transformer = load_diffusion_model(
File "/home/ubuntu/SimpleTuner/helpers/training/diffusion_model.py", line 91, in load_diffusion_model
transformer = FluxTransformer2DModel.from_pretrained(
File "/home/ubuntu/SimpleTuner/.venv/lib/python3.10/site-packages/huggingface_hub/utils/_validators.py", line 114, in _inner_fn
return fn(*args, **kwargs)
File "/home/ubuntu/SimpleTuner/.venv/lib/python3.10/site-packages/diffusers/models/modeling_utils.py", line 821, in from_pretrained
model = cls.from_config(config, **unused_kwargs)
File "/home/ubuntu/SimpleTuner/.venv/lib/python3.10/site-packages/diffusers/configuration_utils.py", line 260, in from_config
model = cls(**init_dict)
File "/home/ubuntu/SimpleTuner/.venv/lib/python3.10/site-packages/diffusers/configuration_utils.py", line 665, in inner_init
init(self, *args, **init_kwargs)
File "/home/ubuntu/SimpleTuner/helpers/models/flux/transformer.py", line 463, in init
[
File "/home/ubuntu/SimpleTuner/helpers/models/flux/transformer.py", line 464, in
FluxTransformerBlock(
File "/home/ubuntu/SimpleTuner/helpers/models/flux/transformer.py", line 305, in init
primary_device = torch.cuda.get_device_properties(rank)
File "/home/ubuntu/SimpleTuner/.venv/lib/python3.10/site-packages/torch/cuda/init.py", line 526, in get_device_properties
raise AssertionError("Invalid device id")
AssertionError: Invalid device id

Invalid device id
Traceback (most recent call last):

get_device_properties(rank), is this rank is wrong? should be the slave ip machine's local rank.

Answer 1 · 2024-12-17T13:28:28.000Z

~~did you follow the distributed.md document?~~

nevermind, i missed that.

but i ran into this, it was just that one of the systems had 7 GPUs enumerated instead of 8. one of the GPUs was bad.

Answer 2 · 2024-12-17T14:30:07.000Z

~~did you follow the distributed.md document?~~

nevermind, i missed that.

but i ran into this, it was just that one of the systems had 7 GPUs enumerated instead of 8. one of the GPUs was bad.

you said one of my gpu is bad？or there is a problem with the GPU ID allocation?

Answer 3 · 2024-12-17T15:27:49.000Z

the GPU ID allocation occurs at a higher level inside pytorch or somewhere, i'm really not too sure. but if it's having a problem, there's little that can be done on this side of things.

Answer 4 · 2024-12-17T16:44:16.000Z

the GPU ID allocation occurs at a higher level inside pytorch or somewhere, i'm really not too sure. but if it's having a problem, there's little that can be done on this side of things.

Will you want to reproduce and solve this problem? I'll print out the ID numbers tomorrow to observe them.

Answer 5 · 2024-12-17T17:14:20.000Z

i am pretty confident that any solution to this issue would not require a change in the trainers' source code, since it is used on H100 multi-node training clusters.

Answer 6 · 2024-12-18T06:59:51.000Z

i am pretty confident that any solution to this issue would not require a change in the trainers' source code, since it is used on H100 multi-node training clusters.
Thank you very much! But when I set the get_device_properties(rank) to get_device_properties(rank%num_of_GPUs_per_device), load the pretrain model right. And so I worry this set will cause another problem.

Answer 7 · 2024-12-18T14:58:18.000Z

i've put a fix in after considering for a bit that H100 systems only ever have H100s in them; we can just check rank 0.