Using argument --train_text_encoder causes crash with an error: ERROR:torch.distributed.elastic.multiprocessing.api:failed
a-l-e-x-d-s-9 opened this issue ยท 1 comments
a-l-e-x-d-s-9 commented
Describe the bug
Training with 8GB vram card works fine, when adding an argument "--train_text_encoder", the script crashes with an error: "ERROR:torch.distributed.elastic.multiprocessing.api:failed", log provided.
Reproduction
Training with the argument "--train_text_encoder" on 8GB vram video card.
Logs
accelerate launch --mixed_precision="fp16" train_dreambooth.py \
--pretrained_model_name_or_path="$MODEL_NAME" \
--instance_data_dir="$INSTANCE_DIR" \
--class_data_dir="$CLASS_DIR" \
--output_dir="$OUTPUT_DIR" \
--with_prior_preservation --prior_loss_weight=1.0 \
--train_text_encoder \
--instance_prompt="zkz" \
--class_prompt="artstyle" \
--resolution=512 \
--train_batch_size=1 \
--sample_batch_size=1 \
--gradient_accumulation_steps=16 --gradient_checkpointing \
--learning_rate=5e-7 \
--lr_scheduler="constant" \
--lr_warmup_steps=0 \
--max_train_steps=5500 \
--save_min_steps 2000 \
--save_interval 500
[W socket.cpp:426] [c10d] The server socket cannot be initialized on [::]:29500 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:601] [c10d] The client socket cannot be initialized to connect to [localhost]:29500 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:601] [c10d] The client socket cannot be initialized to connect to [localhost]:29500 (errno: 97 - Address family not supported by protocol).
/usr/local/lib/python3.10/dist-packages/accelerate/accelerator.py:231: FutureWarning: `logging_dir` is deprecated and will be removed in version 0.18.0 of ๐ค Accelerate. Use `project_dir` instead.
warnings.warn(
[2023-02-17 09:34:41,016] [INFO] [comm.py:654:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
[W socket.cpp:601] [c10d] The client socket cannot be initialized to connect to [localhost]:29500 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:601] [c10d] The client socket cannot be initialized to connect to [localhost]:29500 (errno: 97 - Address family not supported by protocol).
/usr/local/lib/python3.10/dist-packages/diffusers/configuration_utils.py:195: FutureWarning: It is deprecated to pass a pretrained model name or path to `from_config`.If you were trying to load a scheduler, please use <class 'diffusers.schedulers.scheduling_ddpm.DDPMScheduler'>.from_pretrained(...) instead. Otherwise, please make sure to pass a configuration dictionary instead. This functionality will be removed in v1.0.0.
deprecate("config-passed-as-path", "1.0.0", deprecation_message, standard_warn=False)
Caching latents: 100%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ| 100/100 [00:23<00:00, 4.30it/s]
[2023-02-17 09:35:09,847] [INFO] [logging.py:68:log_dist] [Rank 0] DeepSpeed info: version=0.7.7, git-hash=unknown, git-branch=unknown
02/17/2023 09:35:09 - INFO - torch.distributed.distributed_c10d - Added key: store_based_barrier_key:2 to store for rank: 0
02/17/2023 09:35:09 - INFO - torch.distributed.distributed_c10d - Rank 0: Completed store-based barrier for key:store_based_barrier_key:2 with 1 nodes.
[2023-02-17 09:35:10,030] [INFO] [logging.py:68:log_dist] [Rank 0] DeepSpeed Flops Profiler Enabled: False
[2023-02-17 09:35:10,030] [INFO] [logging.py:68:log_dist] [Rank 0] Removing param_group that has no 'params' in the client Optimizer
[2023-02-17 09:35:10,030] [INFO] [logging.py:68:log_dist] [Rank 0] Using client Optimizer as basic optimizer
[2023-02-17 09:35:10,053] [INFO] [logging.py:68:log_dist] [Rank 0] DeepSpeed Basic Optimizer = AdamW
[2023-02-17 09:35:10,053] [INFO] [utils.py:52:is_zero_supported_optimizer] Checking ZeRO support for optimizer=AdamW type=<class 'torch.optim.adamw.AdamW'>
[2023-02-17 09:35:10,053] [INFO] [logging.py:68:log_dist] [Rank 0] Creating fp16 ZeRO stage 2 optimizer
[2023-02-17 09:35:10,053] [INFO] [stage_1_and_2.py:140:__init__] Reduce bucket size 500,000,000
[2023-02-17 09:35:10,053] [INFO] [stage_1_and_2.py:141:__init__] Allgather bucket size 500,000,000
[2023-02-17 09:35:10,053] [INFO] [stage_1_and_2.py:142:__init__] CPU Offload: True
[2023-02-17 09:35:10,053] [INFO] [stage_1_and_2.py:143:__init__] Round robin gradient partitioning: False
Using /home/username/.cache/torch_extensions/py310_cu117 as PyTorch extensions root...
Emitting ninja build file /home/username/.cache/torch_extensions/py310_cu117/utils/build.ninja...
Building extension module utils...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module utils...
Time to load utils op: 0.10493111610412598 seconds
Rank: 0 partition count [1] and sizes[(982581444, False)]
[2023-02-17 09:35:13,148] [INFO] [utils.py:827:see_memory_usage] Before initializing optimizer states
[2023-02-17 09:35:13,148] [INFO] [utils.py:828:see_memory_usage] MA 3.87 GB Max_MA 5.18 GB CA 4.12 GB Max_CA 5 GB
[2023-02-17 09:35:13,148] [INFO] [utils.py:836:see_memory_usage] CPU Virtual Memory: used = 9.52 GB, percent = 30.4%
[2023-02-17 09:35:23,685] [INFO] [utils.py:827:see_memory_usage] After initializing optimizer states
[2023-02-17 09:35:23,691] [INFO] [utils.py:828:see_memory_usage] MA 3.87 GB Max_MA 3.87 GB CA 4.12 GB Max_CA 4 GB
[2023-02-17 09:35:23,692] [INFO] [utils.py:836:see_memory_usage] CPU Virtual Memory: used = 19.95 GB, percent = 63.8%
[2023-02-17 09:35:23,692] [INFO] [stage_1_and_2.py:525:__init__] optimizer state initialized
[2023-02-17 09:35:23,743] [INFO] [utils.py:827:see_memory_usage] After initializing ZeRO optimizer
[2023-02-17 09:35:23,743] [INFO] [utils.py:828:see_memory_usage] MA 3.87 GB Max_MA 3.87 GB CA 4.12 GB Max_CA 4 GB
[2023-02-17 09:35:23,744] [INFO] [utils.py:836:see_memory_usage] CPU Virtual Memory: used = 19.95 GB, percent = 63.8%
Traceback (most recent call last):
File "/home/username/ShivamShriraoDiffusers/examples/dreambooth/train_dreambooth.py", line 869, in <module>
main(args)
File "/home/username/ShivamShriraoDiffusers/examples/dreambooth/train_dreambooth.py", line 684, in main
unet, text_encoder, optimizer, train_dataloader, lr_scheduler = accelerator.prepare(
File "/usr/local/lib/python3.10/dist-packages/accelerate/accelerator.py", line 943, in prepare
result = self._prepare_deepspeed(*args)
File "/usr/local/lib/python3.10/dist-packages/accelerate/accelerator.py", line 1173, in _prepare_deepspeed
engine, optimizer, _, lr_scheduler = deepspeed.initialize(**kwargs)
File "/usr/local/lib/python3.10/dist-packages/deepspeed/__init__.py", line 125, in initialize
engine = DeepSpeedEngine(args=args,
File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/engine.py", line 330, in __init__
self._configure_optimizer(optimizer, model_parameters)
File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/engine.py", line 1210, in _configure_optimizer
self.optimizer = self._configure_zero_optimizer(basic_optimizer)
File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/engine.py", line 1455, in _configure_zero_optimizer
optimizer = DeepSpeedZeroOptimizer(
File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 532, in __init__
self._param_slice_mappings = self._create_param_mapping()
File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 544, in _create_param_mapping
lp_name = self.param_names[lp]
KeyError: Parameter containing:
tensor([[[[-0.0306, 0.0856, 0.0961],
[-0.0310, -0.2236, 0.1437],
[-0.0229, -0.0176, 0.0335]],
[[ 0.0575, -0.0858, -0.0020],
[ 0.2028, -0.1363, -0.0154],
[ 0.0056, -0.0035, -0.0050]],
[[ 0.0284, -0.0120, 0.0429],
[-0.0232, 0.0714, 0.0815],
[-0.0006, -0.0079, -0.0421]],
[[ 0.0213, -0.0481, 0.0459],
[ 0.0588, -0.1469, 0.0871],
[ 0.0160, -0.0206, 0.0467]]],
[[[ 0.0040, -0.0283, -0.0498],
[-0.0012, 0.0363, 0.1064],
[-0.0128, 0.0153, 0.0205]],
[[ 0.0036, 0.0103, -0.0124],
[-0.0005, 0.0883, 0.0136],
[ 0.0017, 0.0496, 0.0354]],
[[ 0.0054, 0.0634, 0.0172],
[ 0.0334, -0.2232, 0.0301],
[ 0.0217, -0.0625, 0.0174]],
[[-0.0254, -0.0479, 0.0177],
[-0.0207, 0.0799, 0.0045],
[-0.0218, 0.0142, -0.0304]]],
[[[-0.0639, -0.0994, -0.0292],
[ 0.0400, 0.0553, 0.0007],
[ 0.0372, 0.0677, 0.0411]],
[[-0.0172, 0.0785, 0.0417],
[ 0.0257, -0.0919, -0.0552],
[-0.0086, 0.0187, -0.0073]],
[[-0.0062, 0.0763, -0.0165],
[-0.0697, 0.2127, -0.0291],
[-0.0024, -0.0885, -0.0085]],
[[-0.0094, -0.0079, -0.0073],
[-0.0123, 0.0632, 0.0640],
[-0.0473, 0.0229, -0.0265]]],
...,
[[[-0.0281, 0.0648, -0.0033],
[ 0.0046, 0.0937, -0.0111],
[ 0.0093, -0.0373, 0.0167]],
[[ 0.0242, 0.0179, 0.0409],
[ 0.0571, 0.0838, -0.0234],
[-0.0122, 0.0471, -0.0046]],
[[ 0.0317, 0.0052, 0.0059],
[ 0.0648, 0.0198, 0.1066],
[ 0.0247, -0.0276, 0.0881]],
[[ 0.0699, 0.0481, 0.0511],
[ 0.0502, -0.1165, 0.0168],
[ 0.0134, -0.0013, 0.0407]]],
[[[-0.0119, 0.0146, 0.0268],
[ 0.0594, -0.0296, 0.0532],
[ 0.0069, 0.0775, -0.0472]],
[[-0.0390, -0.0125, 0.0228],
[ 0.0482, 0.0641, 0.0502],
[ 0.0292, 0.0352, -0.0029]],
[[ 0.0824, -0.0152, 0.0623],
[-0.0185, 0.0547, -0.0074],
[-0.0610, -0.1374, -0.1008]],
[[-0.0064, 0.0069, -0.0249],
[ 0.0226, -0.0157, 0.0029],
[ 0.0493, 0.0859, 0.0077]]],
[[[-0.1221, -0.0880, 0.0061],
[ 0.0974, -0.0250, 0.0491],
[ 0.0509, 0.0339, 0.0598]],
[[ 0.1167, 0.0466, 0.0113],
[ 0.0593, -0.0068, -0.0049],
[-0.0960, 0.0353, -0.0544]],
[[-0.0608, -0.0522, -0.0227],
[-0.0295, 0.0705, -0.0385],
[ 0.0194, 0.0134, -0.0071]],
[[-0.0370, -0.0483, -0.0071],
[ 0.0369, 0.0301, 0.0346],
[ 0.0528, -0.0056, 0.0150]]]], device='cuda:0', requires_grad=True)
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 14351) of binary: /usr/bin/python3
System Info
diffusers
version: 0.12.1- Platform: Linux-5.15.0-60-generic-x86_64-with-glibc2.35
- Python version: 3.10.6
- PyTorch version (GPU?): 1.13.1+cu117 (True)
- Huggingface_hub version: 0.11.1
- Transformers version: 0.16.0
- Accelerate version: not installed
- xFormers version: 0.0.16
- Using GPU in script?:
- Using distributed or parallel set-up in script?:
ShivamShrirao commented
Deep speed training isn't properly supported. Haven't been tested with it.