
Windows系統多卡訓練出現system error: 10049

改成$multi_gpu = 2後就出現:
[W ..\torch\csrc\distributed\c10d\socket.cpp:601] [c10d] The client socket has failed to connect to [DESKTOP-G8GCIRB]:29500 (system error: 10049 - 在其上下文中,该请求的地址无效。).
RuntimeError: Timed out initializing process group in store based barrier on rank: 0, for key: store_based_barrier_key:1 (world_size=1, worker_count=2, timeout=0:30:00)

21:32:25-432945 INFO Torch backend: nVidia CUDA 11.8 cuDNN 8700
21:32:25-443295 INFO Torch detected GPU: NVIDIA GeForce RTX 3090 VRAM 24576 Arch (8, 6) Cores 82
21:32:25-452047 INFO Torch detected GPU: NVIDIA GeForce RTX 3090 VRAM 24576 Arch (8, 6) Cores 82
21:37:01-728473 INFO Training started with config file / 训练开始,使用配置文件:
21:37:01-750403 INFO Using GPU(s) / 使用 GPU: ['0', '1']
21:37:01-760003 INFO Task 3aba0559-0d37-441a-9c14-b72fc2420e3f created
NOTE: Redirects are currently not supported in Windows or MacOs.
Using RTX 3090 or 4000 series which doesn't support faster communication speedups. Ensuring P2P and IB communications are disabled.
Loading settings from D:\lora-scripts\config\autosave\20240311-213701.toml...
prepare tokenizer
update token length: 255
Using DreamBooth method.
prepare images.
found directory Z:\data\LoRA_training\19\H\15_H contains 10 image files
150 train images with repeating.
0 reg images.
no regularization images / 正則化画像が見つかりませんでした
[Dataset 0]
batch_size: 4
resolution: (512, 512)
enable_bucket: True
min_bucket_reso: 256
max_bucket_reso: 1024
bucket_reso_steps: 64
bucket_no_upscale: False

[Subset 0 of Dataset 0]
image_dir: "Z:\data\LoRA_training\19\H\15_H"
image_count: 10
num_repeats: 15
shuffle_caption: True
keep_tokens: 0
caption_dropout_rate: 0.0
caption_dropout_every_n_epoches: 0
caption_tag_dropout_rate: 0.0
caption_prefix: None
caption_suffix: None
color_aug: False
flip_aug: False
face_crop_aug_range: None
random_crop: False
token_warmup_min: 1,
token_warmup_step: 0,
is_reg: False
class_tokens: H
caption_extension: .txt

[Dataset 0]
loading image sizes.
[Dataset 0]
100%|██████████████████████████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 41.62it/s]
number of images (including repeats) / 各bucketの画像枚数(繰り返し回数を含む)
bucket 0: resolution (512, 512), count: 150
bucket 0: resolution (512, 512), count: 150mean ar error (without repeats): 0.0

preparing accelerator
loading model for process 0/2
load StableDiffusion checkpoint: D:/sd-webui-aki-v4/models/Stable-diffusion/epicrealism_naturalSin.safetensors
UNet2DConditionModel: 64, 8, 768, False, False
loading u-net:
loading vae:
loading text encoder:
import network module: networks.lora
[Dataset 0]
caching latents.
checking cache validity...
100%|████████████████████████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 8707.29it/s]
caching latents...
100%|██████████████████████████████████████████████████████████████████████████████████| 10/10 [00:02<00:00, 3.45it/s]
create LoRA network. base dim (rank): 128, alpha: 1
neuron dropout: p=None, rank dropout: p=None, module dropout: p=None
create LoRA for Text Encoder:
create LoRA for Text Encoder: 72 modules.
create LoRA for U-Net: 192 modules.
enable LoRA for text encoder
enable LoRA for U-Net
CrossAttnDownBlock2D False -> True
CrossAttnDownBlock2D False -> True
CrossAttnDownBlock2D False -> True
DownBlock2D False -> True
UNetMidBlock2DCrossAttn False -> True
UpBlock2D False -> True
CrossAttnUpBlock2D False -> True
CrossAttnUpBlock2D False -> True
CrossAttnUpBlock2D False -> True
use Adafactor optimizer | {'relative_step': True}
relative_step is true / relative_stepがtrueです
learning rate is used as initial_lr / 指定したlearning rateはinitial_lrとして使用されます
unet_lr and text_encoder_lr are ignored / unet_lrとtext_encoder_lrは無視されます
use adafactor_scheduler / スケジューラにadafactor_schedulerを使用します
override steps. steps for 20 epochs is / 指定エポックまでのステップ数: 380
running training / 学習開始
num train images * repeats / 学習画像の数×繰り返し回数: 150
num reg images / 正則化画像の数: 0
num batches per epoch / 1epochのバッチ数: 19
num epochs / epoch数: 20
batch size per device / バッチサイズ: 4
gradient accumulation steps / 勾配を合計するステップ数 = 1
total optimization steps / 学習ステップ数: 380
steps: 0%| | 0/380 [00:00<?, ?it/s]
epoch 1/20
rank:1, local_rank:1, world_size:2
rank:0, local_rank:0, world_size:2
Traceback (most recent call last):
File "", line 1, in
File "C:\Users\vipuser\AppData\Local\Programs\Python\Python310\lib\multiprocessing\spawn.py", line 116, in spawn_main
exitcode = _main(fd, parent_sentinel)
File "C:\Users\vipuser\AppData\Local\Programs\Python\Python310\lib\multiprocessing\spawn.py", line 125, in _main
File "C:\Users\vipuser\AppData\Local\Programs\Python\Python310\lib\multiprocessing\spawn.py", line 236, in prepare
File "C:\Users\vipuser\AppData\Local\Programs\Python\Python310\lib\multiprocessing\spawn.py", line 287, in _fixup_main_from_path
main_content = runpy.run_path(main_path,
File "C:\Users\vipuser\AppData\Local\Programs\Python\Python310\lib\runpy.py", line 289, in run_path
return _run_module_code(code, init_globals, run_name,
File "C:\Users\vipuser\AppData\Local\Programs\Python\Python310\lib\runpy.py", line 96, in _run_module_code
_run_code(code, mod_globals, init_globals,
File "C:\Users\vipuser\AppData\Local\Programs\Python\Python310\lib\runpy.py", line 86, in _run_code
exec(code, run_globals)
File "D:\lora-scripts\sd-scripts\train_network.py", line 56, in
torch.distributed.init_process_group(backend="gloo", world_size=world_size, rank=rank)
File "D:\lora-scripts\venv\lib\site-packages\torch\distributed\distributed_c10d.py", line 932, in init_process_group
_store_based_barrier(rank, store, timeout)
File "D:\lora-scripts\venv\lib\site-packages\torch\distributed\distributed_c10d.py", line 469, in _store_based_barrier
raise RuntimeError(
RuntimeError: Timed out initializing process group in store based barrier on rank: 1, for key: store_based_barrier_key:1 (world_size=2, worker_count=4, timeout=0:30:00)


ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 3221225786) local_rank: 0 (pid: 70072) of binary: D:\lora-scripts\venv\Scripts\python.exe
Traceback (most recent call last):
File "C:\Users\vipuser\AppData\Local\Programs\Python\Python310\lib\runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "C:\Users\vipuser\AppData\Local\Programs\Python\Python310\lib\runpy.py", line 86, in _run_code
exec(code, run_globals)
File "D:\lora-scripts\venv\lib\site-packages\accelerate\commands\launch.py", line 1027, in
File "D:\lora-scripts\venv\lib\site-packages\accelerate\commands\launch.py", line 1023, in main
File "D:\lora-scripts\venv\lib\site-packages\accelerate\commands\launch.py", line 1008, in launch_command
File "D:\lora-scripts\venv\lib\site-packages\accelerate\commands\launch.py", line 666, in multi_gpu_launcher
File "D:\lora-scripts\venv\lib\site-packages\torch\distributed\run.py", line 785, in run
File "D:\lora-scripts\venv\lib\site-packages\torch\distributed\launcher\api.py", line 134, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "D:\lora-scripts\venv\lib\site-packages\torch\distributed\launcher\api.py", line 250, in launch_agent
raise ChildFailedError(

./sd-scripts/train_network.py FAILED

time : 2024-03-12_19:24:25
rank : 1 (local_rank: 1)
exitcode : 3221225786 (pid: 39764)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Root Cause (first observed failure):
time : 2024-03-12_19:24:25
rank : 0 (local_rank: 0)
exitcode : 3221225786 (pid: 70072)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

19:24:26-883399 ERROR Training failed / 训练失败