Windows系統多卡訓練出現system error: 10049

Question

Windows系統多卡訓練出現system error: 10049

CCJetWing opened this issue 6 months ago · 3 comments

單卡訓練時沒有問題，能跑得過。
改成$multi_gpu = 2後就出現：
[W ..\torch\csrc\distributed\c10d\socket.cpp:601] [c10d] The client socket has failed to connect to [DESKTOP-G8GCIRB]:29500 (system error: 10049 - 在其上下文中，该请求的地址无效。).
然後進度一直卡住。
等了30分鐘後就出現：
RuntimeError: Timed out initializing process group in store based barrier on rank: 0, for key: store_based_barrier_key:1 (world_size=1, worker_count=2, timeout=0:30:00)
應該如何處理？

21:32:25-432945 INFO Torch backend: nVidia CUDA 11.8 cuDNN 8700
21:32:25-443295 INFO Torch detected GPU: NVIDIA GeForce RTX 3090 VRAM 24576 Arch (8, 6) Cores 82
21:32:25-452047 INFO Torch detected GPU: NVIDIA GeForce RTX 3090 VRAM 24576 Arch (8, 6) Cores 82
21:37:01-728473 INFO Training started with config file / 训练开始，使用配置文件:
D:\lora-scripts\config\autosave\20240311-213701.toml
21:37:01-750403 INFO Using GPU(s) / 使用 GPU: ['0', '1']
21:37:01-760003 INFO Task 3aba0559-0d37-441a-9c14-b72fc2420e3f created
NOTE: Redirects are currently not supported in Windows or MacOs.
Using RTX 3090 or 4000 series which doesn't support faster communication speedups. Ensuring P2P and IB communications are disabled.
[W ..\torch\csrc\distributed\c10d\socket.cpp:601] [c10d] The client socket has failed to connect to [DESKTOP-G8GCIRB]:29500 (system error: 10049 - 在其上下文中，该请求的地址无效。).
[W ..\torch\csrc\distributed\c10d\socket.cpp:601] [c10d] The client socket has failed to connect to [DESKTOP-G8GCIRB]:29500 (system error: 10049 - 在其上下文中，该请求的地址无效。).
Loading settings from D:\lora-scripts\config\autosave\20240311-213701.toml...
Loading settings from D:\lora-scripts\config\autosave\20240311-213701.toml...
D:\lora-scripts\config\autosave\20240311-213701
D:\lora-scripts\config\autosave\20240311-213701
prepare tokenizer
prepare tokenizer
update token length: 255
Using DreamBooth method.
update token length: 255
Using DreamBooth method.
prepare images.
found directory Z:\data\LoRA_training\19\H\15_H contains 10 image files
150 train images with repeating.
0 reg images.
no regularization images / 正則化画像が見つかりませんでした
[Dataset 0]
batch_size: 4
resolution: (512, 512)
enable_bucket: True
min_bucket_reso: 256
max_bucket_reso: 1024
bucket_reso_steps: 64
bucket_no_upscale: False

[Subset 0 of Dataset 0]
image_dir: "Z:\data\LoRA_training\19\H\15_H"
image_count: 10
num_repeats: 15
shuffle_caption: True
keep_tokens: 0
keep_tokens_separator:
caption_dropout_rate: 0.0
caption_dropout_every_n_epoches: 0
caption_tag_dropout_rate: 0.0
caption_prefix: None
caption_suffix: None
color_aug: False
flip_aug: False
face_crop_aug_range: None
random_crop: False
token_warmup_min: 1,
token_warmup_step: 0,
is_reg: False
class_tokens: H
caption_extension: .txt

[Dataset 0]
loading image sizes.
prepare images.
0%| | 0/10 [00:00<?, ?it/s]found directory Z:\data\LoRA_training\19\H\15_H contains 10 image files
150 train images with repeating.
0 reg images.
no regularization images / 正則化画像が見つかりませんでした
[Dataset 0]
batch_size: 4
resolution: (512, 512)
enable_bucket: True
min_bucket_reso: 256
max_bucket_reso: 1024
bucket_reso_steps: 64
bucket_no_upscale: False

[Subset 0 of Dataset 0]
image_dir: "Z:\data\LoRA_training\19\H\15_H"
image_count: 10
num_repeats: 15
shuffle_caption: True
keep_tokens: 0
keep_tokens_separator:
caption_dropout_rate: 0.0
caption_dropout_every_n_epoches: 0
caption_tag_dropout_rate: 0.0
caption_prefix: None
caption_suffix: None
color_aug: False
flip_aug: False
face_crop_aug_range: None
random_crop: False
token_warmup_min: 1,
token_warmup_step: 0,
is_reg: False
class_tokens: H
caption_extension: .txt

[Dataset 0]
loading image sizes.
100%|██████████████████████████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 41.62it/s]
100%|██████████████████████████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 54.12it/s]
make bucketsmake buckets

number of images (including repeats) / 各bucketの画像枚数（繰り返し回数を含む）
number of images (including repeats) / 各bucketの画像枚数（繰り返し回数を含む）
bucket 0: resolution (512, 512), count: 150
bucket 0: resolution (512, 512), count: 150mean ar error (without repeats): 0.0

mean ar error (without repeats): 0.0
preparing accelerator
preparing accelerator
loading model for process 0/2
load StableDiffusion checkpoint: D:/sd-webui-aki-v4/models/Stable-diffusion/epicrealism_naturalSin.safetensors
UNet2DConditionModel: 64, 8, 768, False, False
loading u-net:
loading vae:
loading text encoder:
loading model for process 1/2
load StableDiffusion checkpoint: D:/sd-webui-aki-v4/models/Stable-diffusion/epicrealism_naturalSin.safetensors
UNet2DConditionModel: 64, 8, 768, False, False
loading u-net:
loading vae:
loading text encoder:
Enable xformers for U-Net
Enable xformers for U-Net
import network module: networks.lora
[Dataset 0]
caching latents.
checking cache validity...
100%|████████████████████████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 8707.29it/s]
[Dataset 0]
caching latents.
checking cache validity...
100%|████████████████████████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 3763.73it/s]
caching latents...
100%|██████████████████████████████████████████████████████████████████████████████████| 10/10 [00:02<00:00, 3.45it/s]
create LoRA network. base dim (rank): 128, alpha: 1
neuron dropout: p=None, rank dropout: p=None, module dropout: p=None
create LoRA for Text Encoder:
create LoRA for Text Encoder: 72 modules.
create LoRA network. base dim (rank): 128, alpha: 1
neuron dropout: p=None, rank dropout: p=None, module dropout: p=None
create LoRA for Text Encoder:
create LoRA for Text Encoder: 72 modules.
create LoRA for U-Net: 192 modules.
enable LoRA for text encoder
enable LoRA for U-Net
CrossAttnDownBlock2D False -> True
CrossAttnDownBlock2D False -> True
CrossAttnDownBlock2D False -> True
DownBlock2D False -> True
UNetMidBlock2DCrossAttn False -> True
UpBlock2D False -> True
CrossAttnUpBlock2D False -> True
CrossAttnUpBlock2D False -> True
CrossAttnUpBlock2D False -> True
use Adafactor optimizer | {'relative_step': True}
relative_step is true / relative_stepがtrueです
learning rate is used as initial_lr / 指定したlearning rateはinitial_lrとして使用されます
unet_lr and text_encoder_lr are ignored / unet_lrとtext_encoder_lrは無視されます
use adafactor_scheduler / スケジューラにadafactor_schedulerを使用します
create LoRA for U-Net: 192 modules.
enable LoRA for text encoder
enable LoRA for U-Net
CrossAttnDownBlock2D False -> True
CrossAttnDownBlock2D False -> True
CrossAttnDownBlock2D False -> True
DownBlock2D False -> True
UNetMidBlock2DCrossAttn False -> True
UpBlock2D False -> True
CrossAttnUpBlock2D False -> True
CrossAttnUpBlock2D False -> True
CrossAttnUpBlock2D False -> True
prepare optimizer, data loader etc.
use Adafactor optimizer | {'relative_step': True}
relative_step is true / relative_stepがtrueです
learning rate is used as initial_lr / 指定したlearning rateはinitial_lrとして使用されます
unet_lr and text_encoder_lr are ignored / unet_lrとtext_encoder_lrは無視されます
use adafactor_scheduler / スケジューラにadafactor_schedulerを使用します
override steps. steps for 20 epochs is / 指定エポックまでのステップ数: 380
running training / 学習開始
num train images * repeats / 学習画像の数×繰り返し回数: 150
num reg images / 正則化画像の数: 0
num batches per epoch / 1epochのバッチ数: 19
num epochs / epoch数: 20
batch size per device / バッチサイズ: 4
gradient accumulation steps / 勾配を合計するステップ数 = 1
total optimization steps / 学習ステップ数: 380
steps: 0%| | 0/380 [00:00<?, ?it/s]
epoch 1/20
rank:1, local_rank:1, world_size:2
[W ..\torch\csrc\distributed\c10d\socket.cpp:601] [c10d] The client socket has failed to connect to [DESKTOP-G8GCIRB]:29500 (system error: 10049 - 在其上下文中，该请求的地址无效。).
[W ..\torch\csrc\distributed\c10d\socket.cpp:601] [c10d] The client socket has failed to connect to [DESKTOP-G8GCIRB]:29500 (system error: 10049 - 在其上下文中，该请求的地址无效。).
rank:0, local_rank:0, world_size:2
[W ..\torch\csrc\distributed\c10d\socket.cpp:601] [c10d] The client socket has failed to connect to [DESKTOP-G8GCIRB]:29500 (system error: 10049 - 在其上下文中，该请求的地址无效。).
[W ..\torch\csrc\distributed\c10d\socket.cpp:601] [c10d] The client socket has failed to connect to [DESKTOP-G8GCIRB]:29500 (system error: 10049 - 在其上下文中，该请求的地址无效。).
Traceback (most recent call last):
File "", line 1, in
File "C:\Users\vipuser\AppData\Local\Programs\Python\Python310\lib\multiprocessing\spawn.py", line 116, in spawn_main
exitcode = _main(fd, parent_sentinel)
File "C:\Users\vipuser\AppData\Local\Programs\Python\Python310\lib\multiprocessing\spawn.py", line 125, in _main
prepare(preparation_data)
File "C:\Users\vipuser\AppData\Local\Programs\Python\Python310\lib\multiprocessing\spawn.py", line 236, in prepare
_fixup_main_from_path(data['init_main_from_path'])
File "C:\Users\vipuser\AppData\Local\Programs\Python\Python310\lib\multiprocessing\spawn.py", line 287, in _fixup_main_from_path
main_content = runpy.run_path(main_path,
File "C:\Users\vipuser\AppData\Local\Programs\Python\Python310\lib\runpy.py", line 289, in run_path
return _run_module_code(code, init_globals, run_name,
File "C:\Users\vipuser\AppData\Local\Programs\Python\Python310\lib\runpy.py", line 96, in _run_module_code
_run_code(code, mod_globals, init_globals,
File "C:\Users\vipuser\AppData\Local\Programs\Python\Python310\lib\runpy.py", line 86, in _run_code
exec(code, run_globals)
File "D:\lora-scripts\sd-scripts\train_network.py", line 56, in
torch.distributed.init_process_group(backend="gloo", world_size=world_size, rank=rank)
File "D:\lora-scripts\venv\lib\site-packages\torch\distributed\distributed_c10d.py", line 932, in init_process_group
_store_based_barrier(rank, store, timeout)
File "D:\lora-scripts\venv\lib\site-packages\torch\distributed\distributed_c10d.py", line 469, in _store_based_barrier
raise RuntimeError(
RuntimeError: Timed out initializing process group in store based barrier on rank: 1, for key: store_based_barrier_key:1 (world_size=2, worker_count=4, timeout=0:30:00)

Akegarasu commented 6 months ago

#308

Answer 1 · 2024-03-12T01:52:02.000Z

我就是按#308改了之後還是遇到這個問題才上來問的

Answer 2 · 2024-03-12T11:26:28.000Z

ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 3221225786) local_rank: 0 (pid: 70072) of binary: D:\lora-scripts\venv\Scripts\python.exe
Traceback (most recent call last):
File "C:\Users\vipuser\AppData\Local\Programs\Python\Python310\lib\runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "C:\Users\vipuser\AppData\Local\Programs\Python\Python310\lib\runpy.py", line 86, in _run_code
exec(code, run_globals)
File "D:\lora-scripts\venv\lib\site-packages\accelerate\commands\launch.py", line 1027, in
main()
File "D:\lora-scripts\venv\lib\site-packages\accelerate\commands\launch.py", line 1023, in main
launch_command(args)
File "D:\lora-scripts\venv\lib\site-packages\accelerate\commands\launch.py", line 1008, in launch_command
multi_gpu_launcher(args)
File "D:\lora-scripts\venv\lib\site-packages\accelerate\commands\launch.py", line 666, in multi_gpu_launcher
distrib_run.run(args)
File "D:\lora-scripts\venv\lib\site-packages\torch\distributed\run.py", line 785, in run
elastic_launch(
File "D:\lora-scripts\venv\lib\site-packages\torch\distributed\launcher\api.py", line 134, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "D:\lora-scripts\venv\lib\site-packages\torch\distributed\launcher\api.py", line 250, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

./sd-scripts/train_network.py FAILED

Failures:
[1]:
time : 2024-03-12_19:24:25
host : DESKTOP-G8GCIRB
rank : 1 (local_rank: 1)
exitcode : 3221225786 (pid: 39764)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Root Cause (first observed failure):
[0]:
time : 2024-03-12_19:24:25
host : DESKTOP-G8GCIRB
rank : 0 (local_rank: 0)
exitcode : 3221225786 (pid: 70072)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

19:24:26-883399 ERROR Training failed / 训练失败

./sd-scripts/train_network.py FAILED

Failures: [1]: time : 2024-03-12_19:24:25 host : DESKTOP-G8GCIRB rank : 1 (local_rank: 1) exitcode : 3221225786 (pid: 39764) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Root Cause (first observed failure): [0]: time : 2024-03-12_19:24:25 host : DESKTOP-G8GCIRB rank : 0 (local_rank: 0) exitcode : 3221225786 (pid: 70072) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Failures:
[1]:
time : 2024-03-12_19:24:25
host : DESKTOP-G8GCIRB
rank : 1 (local_rank: 1)
exitcode : 3221225786 (pid: 39764)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Root Cause (first observed failure):
[0]:
time : 2024-03-12_19:24:25
host : DESKTOP-G8GCIRB
rank : 0 (local_rank: 0)
exitcode : 3221225786 (pid: 70072)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html