Using argument --train_text_encoder causes crash with an error: ERROR:torch.distributed.elastic.multiprocessing.api:failed

Question

Using argument --train_text_encoder causes crash with an error: ERROR:torch.distributed.elastic.multiprocessing.api:failed

a-l-e-x-d-s-9 opened this issue 2 years ago · 1 comments

Describe the bug

Training with 8GB vram card works fine, when adding an argument "--train_text_encoder", the script crashes with an error: "ERROR:torch.distributed.elastic.multiprocessing.api:failed", log provided.

Reproduction

Training with the argument "--train_text_encoder" on 8GB vram video card.

Logs

accelerate launch --mixed_precision="fp16" train_dreambooth.py \
  --pretrained_model_name_or_path="$MODEL_NAME"  \
  --instance_data_dir="$INSTANCE_DIR" \
  --class_data_dir="$CLASS_DIR" \
  --output_dir="$OUTPUT_DIR" \
  --with_prior_preservation --prior_loss_weight=1.0 \
  --train_text_encoder \
  --instance_prompt="zkz" \
  --class_prompt="artstyle" \
  --resolution=512 \
  --train_batch_size=1 \
  --sample_batch_size=1 \
  --gradient_accumulation_steps=16 --gradient_checkpointing \
  --learning_rate=5e-7 \
  --lr_scheduler="constant" \
  --lr_warmup_steps=0 \
  --max_train_steps=5500 \
  --save_min_steps 2000 \
  --save_interval 500
[W socket.cpp:426] [c10d] The server socket cannot be initialized on [::]:29500 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:601] [c10d] The client socket cannot be initialized to connect to [localhost]:29500 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:601] [c10d] The client socket cannot be initialized to connect to [localhost]:29500 (errno: 97 - Address family not supported by protocol).
/usr/local/lib/python3.10/dist-packages/accelerate/accelerator.py:231: FutureWarning: `logging_dir` is deprecated and will be removed in version 0.18.0 of 🤗 Accelerate. Use `project_dir` instead.
  warnings.warn(
[2023-02-17 09:34:41,016] [INFO] [comm.py:654:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
[W socket.cpp:601] [c10d] The client socket cannot be initialized to connect to [localhost]:29500 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:601] [c10d] The client socket cannot be initialized to connect to [localhost]:29500 (errno: 97 - Address family not supported by protocol).
/usr/local/lib/python3.10/dist-packages/diffusers/configuration_utils.py:195: FutureWarning: It is deprecated to pass a pretrained model name or path to `from_config`.If you were trying to load a scheduler, please use <class 'diffusers.schedulers.scheduling_ddpm.DDPMScheduler'>.from_pretrained(...) instead. Otherwise, please make sure to pass a configuration dictionary instead. This functionality will be removed in v1.0.0.
  deprecate("config-passed-as-path", "1.0.0", deprecation_message, standard_warn=False)
Caching latents: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100/100 [00:23<00:00,  4.30it/s]
[2023-02-17 09:35:09,847] [INFO] [logging.py:68:log_dist] [Rank 0] DeepSpeed info: version=0.7.7, git-hash=unknown, git-branch=unknown
02/17/2023 09:35:09 - INFO - torch.distributed.distributed_c10d - Added key: store_based_barrier_key:2 to store for rank: 0
02/17/2023 09:35:09 - INFO - torch.distributed.distributed_c10d - Rank 0: Completed store-based barrier for key:store_based_barrier_key:2 with 1 nodes.
[2023-02-17 09:35:10,030] [INFO] [logging.py:68:log_dist] [Rank 0] DeepSpeed Flops Profiler Enabled: False
[2023-02-17 09:35:10,030] [INFO] [logging.py:68:log_dist] [Rank 0] Removing param_group that has no 'params' in the client Optimizer
[2023-02-17 09:35:10,030] [INFO] [logging.py:68:log_dist] [Rank 0] Using client Optimizer as basic optimizer
[2023-02-17 09:35:10,053] [INFO] [logging.py:68:log_dist] [Rank 0] DeepSpeed Basic Optimizer = AdamW
[2023-02-17 09:35:10,053] [INFO] [utils.py:52:is_zero_supported_optimizer] Checking ZeRO support for optimizer=AdamW type=<class 'torch.optim.adamw.AdamW'>
[2023-02-17 09:35:10,053] [INFO] [logging.py:68:log_dist] [Rank 0] Creating fp16 ZeRO stage 2 optimizer
[2023-02-17 09:35:10,053] [INFO] [stage_1_and_2.py:140:__init__] Reduce bucket size 500,000,000
[2023-02-17 09:35:10,053] [INFO] [stage_1_and_2.py:141:__init__] Allgather bucket size 500,000,000
[2023-02-17 09:35:10,053] [INFO] [stage_1_and_2.py:142:__init__] CPU Offload: True
[2023-02-17 09:35:10,053] [INFO] [stage_1_and_2.py:143:__init__] Round robin gradient partitioning: False
Using /home/username/.cache/torch_extensions/py310_cu117 as PyTorch extensions root...
Emitting ninja build file /home/username/.cache/torch_extensions/py310_cu117/utils/build.ninja...
Building extension module utils...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module utils...
Time to load utils op: 0.10493111610412598 seconds
Rank: 0 partition count [1] and sizes[(982581444, False)] 
[2023-02-17 09:35:13,148] [INFO] [utils.py:827:see_memory_usage] Before initializing optimizer states
[2023-02-17 09:35:13,148] [INFO] [utils.py:828:see_memory_usage] MA 3.87 GB         Max_MA 5.18 GB         CA 4.12 GB         Max_CA 5 GB 
[2023-02-17 09:35:13,148] [INFO] [utils.py:836:see_memory_usage] CPU Virtual Memory:  used = 9.52 GB, percent = 30.4%
[2023-02-17 09:35:23,685] [INFO] [utils.py:827:see_memory_usage] After initializing optimizer states
[2023-02-17 09:35:23,691] [INFO] [utils.py:828:see_memory_usage] MA 3.87 GB         Max_MA 3.87 GB         CA 4.12 GB         Max_CA 4 GB 
[2023-02-17 09:35:23,692] [INFO] [utils.py:836:see_memory_usage] CPU Virtual Memory:  used = 19.95 GB, percent = 63.8%
[2023-02-17 09:35:23,692] [INFO] [stage_1_and_2.py:525:__init__] optimizer state initialized
[2023-02-17 09:35:23,743] [INFO] [utils.py:827:see_memory_usage] After initializing ZeRO optimizer
[2023-02-17 09:35:23,743] [INFO] [utils.py:828:see_memory_usage] MA 3.87 GB         Max_MA 3.87 GB         CA 4.12 GB         Max_CA 4 GB 
[2023-02-17 09:35:23,744] [INFO] [utils.py:836:see_memory_usage] CPU Virtual Memory:  used = 19.95 GB, percent = 63.8%
Traceback (most recent call last):
  File "/home/username/ShivamShriraoDiffusers/examples/dreambooth/train_dreambooth.py", line 869, in <module>
    main(args)
  File "/home/username/ShivamShriraoDiffusers/examples/dreambooth/train_dreambooth.py", line 684, in main
    unet, text_encoder, optimizer, train_dataloader, lr_scheduler = accelerator.prepare(
  File "/usr/local/lib/python3.10/dist-packages/accelerate/accelerator.py", line 943, in prepare
    result = self._prepare_deepspeed(*args)
  File "/usr/local/lib/python3.10/dist-packages/accelerate/accelerator.py", line 1173, in _prepare_deepspeed
    engine, optimizer, _, lr_scheduler = deepspeed.initialize(**kwargs)
  File "/usr/local/lib/python3.10/dist-packages/deepspeed/__init__.py", line 125, in initialize
    engine = DeepSpeedEngine(args=args,
  File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/engine.py", line 330, in __init__
    self._configure_optimizer(optimizer, model_parameters)
  File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/engine.py", line 1210, in _configure_optimizer
    self.optimizer = self._configure_zero_optimizer(basic_optimizer)
  File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/engine.py", line 1455, in _configure_zero_optimizer
    optimizer = DeepSpeedZeroOptimizer(
  File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 532, in __init__
    self._param_slice_mappings = self._create_param_mapping()
  File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 544, in _create_param_mapping
    lp_name = self.param_names[lp]
KeyError: Parameter containing:
tensor([[[[-0.0306,  0.0856,  0.0961],
          [-0.0310, -0.2236,  0.1437],
          [-0.0229, -0.0176,  0.0335]],

         [[ 0.0575, -0.0858, -0.0020],
          [ 0.2028, -0.1363, -0.0154],
          [ 0.0056, -0.0035, -0.0050]],

         [[ 0.0284, -0.0120,  0.0429],
          [-0.0232,  0.0714,  0.0815],
          [-0.0006, -0.0079, -0.0421]],

         [[ 0.0213, -0.0481,  0.0459],
          [ 0.0588, -0.1469,  0.0871],
          [ 0.0160, -0.0206,  0.0467]]],


        [[[ 0.0040, -0.0283, -0.0498],
          [-0.0012,  0.0363,  0.1064],
          [-0.0128,  0.0153,  0.0205]],

         [[ 0.0036,  0.0103, -0.0124],
          [-0.0005,  0.0883,  0.0136],
          [ 0.0017,  0.0496,  0.0354]],

         [[ 0.0054,  0.0634,  0.0172],
          [ 0.0334, -0.2232,  0.0301],
          [ 0.0217, -0.0625,  0.0174]],

         [[-0.0254, -0.0479,  0.0177],
          [-0.0207,  0.0799,  0.0045],
          [-0.0218,  0.0142, -0.0304]]],


        [[[-0.0639, -0.0994, -0.0292],
          [ 0.0400,  0.0553,  0.0007],
          [ 0.0372,  0.0677,  0.0411]],

         [[-0.0172,  0.0785,  0.0417],
          [ 0.0257, -0.0919, -0.0552],
          [-0.0086,  0.0187, -0.0073]],

         [[-0.0062,  0.0763, -0.0165],
          [-0.0697,  0.2127, -0.0291],
          [-0.0024, -0.0885, -0.0085]],

         [[-0.0094, -0.0079, -0.0073],
          [-0.0123,  0.0632,  0.0640],
          [-0.0473,  0.0229, -0.0265]]],


        ...,


        [[[-0.0281,  0.0648, -0.0033],
          [ 0.0046,  0.0937, -0.0111],
          [ 0.0093, -0.0373,  0.0167]],

         [[ 0.0242,  0.0179,  0.0409],
          [ 0.0571,  0.0838, -0.0234],
          [-0.0122,  0.0471, -0.0046]],

         [[ 0.0317,  0.0052,  0.0059],
          [ 0.0648,  0.0198,  0.1066],
          [ 0.0247, -0.0276,  0.0881]],

         [[ 0.0699,  0.0481,  0.0511],
          [ 0.0502, -0.1165,  0.0168],
          [ 0.0134, -0.0013,  0.0407]]],


        [[[-0.0119,  0.0146,  0.0268],
          [ 0.0594, -0.0296,  0.0532],
          [ 0.0069,  0.0775, -0.0472]],

         [[-0.0390, -0.0125,  0.0228],
          [ 0.0482,  0.0641,  0.0502],
          [ 0.0292,  0.0352, -0.0029]],

         [[ 0.0824, -0.0152,  0.0623],
          [-0.0185,  0.0547, -0.0074],
          [-0.0610, -0.1374, -0.1008]],

         [[-0.0064,  0.0069, -0.0249],
          [ 0.0226, -0.0157,  0.0029],
          [ 0.0493,  0.0859,  0.0077]]],


        [[[-0.1221, -0.0880,  0.0061],
          [ 0.0974, -0.0250,  0.0491],
          [ 0.0509,  0.0339,  0.0598]],

         [[ 0.1167,  0.0466,  0.0113],
          [ 0.0593, -0.0068, -0.0049],
          [-0.0960,  0.0353, -0.0544]],

         [[-0.0608, -0.0522, -0.0227],
          [-0.0295,  0.0705, -0.0385],
          [ 0.0194,  0.0134, -0.0071]],

         [[-0.0370, -0.0483, -0.0071],
          [ 0.0369,  0.0301,  0.0346],
          [ 0.0528, -0.0056,  0.0150]]]], device='cuda:0', requires_grad=True)
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 14351) of binary: /usr/bin/python3

System Info

diffusers version: 0.12.1
Platform: Linux-5.15.0-60-generic-x86_64-with-glibc2.35
Python version: 3.10.6
PyTorch version (GPU?): 1.13.1+cu117 (True)
Huggingface_hub version: 0.11.1
Transformers version: 0.16.0
Accelerate version: not installed
xFormers version: 0.0.16
Using GPU in script?:
Using distributed or parallel set-up in script?:

Answer 1 · 2023-03-03T09:34:36.000Z

Deep speed training isn't properly supported. Haven't been tested with it.