Full finetune memory issue on A100 80GB with DeepSpeed
Closed this issue · 2 comments
okdalto commented
Hi
I encountered a CUDA out of memory (OOM) issue while training a model on an NVIDIA A100 80GB GPU using DeepSpeed ZeRO Stage 2. Despite following the configuration provided in Issue #852, the OOM issue persists.
I used following settings
config.env:
DATALOADER_CONFIG='config/multidatabackend.json'
ASPECT_BUCKET_ROUNDING='2'
TRAINING_SEED='42'
USE_EMA='false'
USE_XFORMERS='false'
MINIMUM_RESOLUTION='0'
OUTPUT_DIR='output'
USE_DORA='false'
USE_BITFIT='false'
PUSH_TO_HUB='false'
PUSH_CHECKPOINTS='false'
NUM_EPOCHS='100'
MAX_NUM_STEPS='0'
CHECKPOINTING_STEPS='2000'
CHECKPOINTING_LIMIT='10'
HUB_MODEL_NAME='simpletuner-full'
TRACKER_PROJECT_NAME='simpletuner-flux-dev'
TRACKER_RUN_NAME='flux-dev'
DEBUG_EXTRA_ARGS='--report_to=wandb'
MODEL_TYPE='full'
MODEL_NAME='black-forest-labs/FLUX.1-dev'
FLUX='true'
PIXART_SIGMA='false'
KOLORS='false'
STABLE_DIFFUSION_3='false'
STABLE_DIFFUSION_LEGACY='false'
TRAIN_BATCH_SIZE='1'
USE_GRADIENT_CHECKPOINTING='true'
GRADIENT_ACCUMULATION_STEPS='2'
CAPTION_DROPOUT_PROBABILITY='0.1'
RESOLUTION_TYPE='pixel_area'
RESOLUTION='1024'
VALIDATION_SEED='0'
VALIDATION_STEPS='2000'
VALIDATION_RESOLUTION='1024x1024'
VALIDATION_GUIDANCE='3.0'
VALIDATION_GUIDANCE_RESCALE='0.0'
VALIDATION_NUM_INFERENCE_STEPS='25'
VALIDATION_PROMPT='A photo-realistic image of a cat'
ALLOW_TF32='true'
MIXED_PRECISION='bf16'
OPTIMIZER='adamw_bf16'
LEARNING_RATE='1e-6'
LR_SCHEDULE='polynomial'
LR_WARMUP_STEPS='10'
TRAINING_NUM_PROCESSES='1'
TRAINING_NUM_MACHINES='1'
VALIDATION_TORCH_COMPILE='true'
TRAINER_DYNAMO_BACKEND='inductor'
TRAINER_EXTRA_ARGS='--lr_end=1e-8 --compress_disk_cache'
MODEL_FAMILY='flux'
multidatabackend.json:
[
{
"id": "laion_ye_pop",
"type": "local",
"instance_data_dir": "/input/SimpleTuner/data",
"crop": false,
"crop_style": "center",
"crop_aspect": "square",
"minimum_image_size": 1024,
"maximum_image_size": 1536,
"target_downsample_size": 1024,
"resolution": 1024,
"resolution_type": "pixel_area",
"caption_strategy": "textfile",
"cache_dir_vae": "cache//vae-1024",
"text_embeds": "laion-ye-pop-embed-cache",
"ignore_epochs": false,
"disabled": false,
"skip_file_discovery": "",
"metadata_backend": "json"
},
{
"id": "laion-ye-pop-embed-cache",
"dataset_type": "text_embeds",
"default": true,
"type": "local",
"cache_dir": "cache//text",
"disabled": false,
"write_batch_size": 128
}
]
deepspeed_config.yaml:
compute_environment: LOCAL_MACHINE
debug: true
deepspeed_config:
gradient_accumulation_steps: 2
gradient_clipping: 1.0
offload_optimizer_device: none
offload_param_device: none
zero3_init_flag: false
zero_stage: 2
distributed_type: DEEPSPEED
downcast_bf16: 'no'
enable_cpu_affinity: false
machine_rank: 0
main_training_function: main
mixed_precision: bf16
num_machines: 1
num_processes: 1
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false
I got this error:
[2024-12-16 23:58:27,174] [INFO] [logging.py:129:log_dist] [Rank 0] Creating torch.bfloat16 ZeRO stage 2 optimizer
[2024-12-16 23:58:27,174] [INFO] [stage_1_and_2.py:149:__init__] Reduce bucket size 500000000
[2024-12-16 23:58:27,174] [INFO] [stage_1_and_2.py:150:__init__] Allgather bucket size 500000000
[2024-12-16 23:58:27,174] [INFO] [stage_1_and_2.py:151:__init__] CPU Offload: False
[2024-12-16 23:58:27,174] [INFO] [stage_1_and_2.py:152:__init__] Round robin gradient partitioning: False
CUDA out of memory. Tried to allocate 44.34 GiB. GPU 0 has a total capacity of 79.15 GiB of which 33.70 GiB is free. Process 3471184 has 45.45 GiB memory in use. Of the allocated memory 44.35 GiB is allocated by PyTorch, and 31.60 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
Could you please help investigate why DeepSpeed fails to utilize the available memory effectively on A100 80GB with these configurations?
bghira commented
please use config.json instead of config.env following the flux quickstart guide
okdalto commented
Thank you for the comment,
However, I tried config.json also
{
"--resume_from_checkpoint": "latest",
"--data_backend_config": "config/multidatabackend.json",
"--aspect_bucket_rounding": 2,
"--seed": 42,
"--minimum_image_size": 0,
"--disable_benchmark": false,
"--output_dir": "output/models",
"--num_train_epochs": 100,
"--max_train_steps": 0,
"--checkpointing_steps": 500,
"--checkpoints_total_limit": 20,
"--attention_mechanism": "diffusers",
"--report_to": "none",
"--model_type": "full",
"--pretrained_model_name_or_path": "black-forest-labs/FLUX.1-dev",
"--model_family": "flux",
"--train_batch_size": 1,
"--gradient_checkpointing": "true",
"--caption_dropout_probability": 0.0,
"--resolution_type": "pixel_area",
"--resolution": 1024,
"--validation_seed": 42,
"--validation_steps": 500,
"--validation_resolution": "1024x1024",
"--validation_guidance": 3.0,
"--validation_guidance_rescale": "0.0",
"--validation_num_inference_steps": "20",
"--validation_prompt": "A photo-realistic image of a cat",
"--mixed_precision": "bf16",
"--optimizer": "adamw_bf16",
"--learning_rate": "5e-5",
"--lr_scheduler": "polynomial",
"--lr_warmup_steps": "0",
"--validation_torch_compile": "false"
}
and still got the error
[2024-12-17 03:02:12,382] [INFO] [logging.py:129:log_dist] [Rank 0] Creating torch.bfloat16 ZeRO stage 2 optimizer
[2024-12-17 03:02:12,382] [INFO] [stage_1_and_2.py:149:__init__] Reduce bucket size 500000000
[2024-12-17 03:02:12,382] [INFO] [stage_1_and_2.py:150:__init__] Allgather bucket size 500000000
[2024-12-17 03:02:12,382] [INFO] [stage_1_and_2.py:151:__init__] CPU Offload: False
[2024-12-17 03:02:12,382] [INFO] [stage_1_and_2.py:152:__init__] Round robin gradient partitioning: False
CUDA out of memory. Tried to allocate 44.34 GiB. GPU 0 has a total capacity of 79.15 GiB of which 33.69 GiB is free. Process 3704516 has 45.45 GiB memory in use. Of the allocated memory 44.35 GiB is allocated by PyTorch, and 37.60 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
Traceback (most recent call last):
File "/input/241205_gentle_monster/ST_test/SimpleTuner/train.py", line 49, in <module>
trainer.resume_and_prepare()
File "/input/241205_gentle_monster/ST_test/SimpleTuner/helpers/training/trainer.py", line 1641, in resume_and_prepare
self.init_prepare_models(lr_scheduler=lr_scheduler)
File "/input/241205_gentle_monster/ST_test/SimpleTuner/helpers/training/trainer.py", line 1326, in init_prepare_models
results = self.accelerator.prepare(
^^^^^^^^^^^^^^^^^^^^^^^^^
File "/input/241205_gentle_monster/ST_test/SimpleTuner/.venv/lib/python3.11/site-packages/accelerate/accelerator.py", line 1318, in prepare
result = self._prepare_deepspeed(*args)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/input/241205_gentle_monster/ST_test/SimpleTuner/.venv/lib/python3.11/site-packages/accelerate/accelerator.py", line 1815, in _prepare_deepspeed
engine, optimizer, _, lr_scheduler = deepspeed.initialize(**kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/input/241205_gentle_monster/ST_test/SimpleTuner/.venv/lib/python3.11/site-packages/deepspeed/__init__.py", line 193, in initialize
engine = DeepSpeedEngine(args=args,
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/input/241205_gentle_monster/ST_test/SimpleTuner/.venv/lib/python3.11/site-packages/deepspeed/runtime/engine.py", line 313, in __init__
self._configure_optimizer(optimizer, model_parameters)
File "/input/241205_gentle_monster/ST_test/SimpleTuner/.venv/lib/python3.11/site-packages/deepspeed/runtime/engine.py", line 1302, in _configure_optimizer
self.optimizer = self._configure_zero_optimizer(basic_optimizer)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/input/241205_gentle_monster/ST_test/SimpleTuner/.venv/lib/python3.11/site-packages/deepspeed/runtime/engine.py", line 1560, in _configure_zero_optimizer
optimizer = DeepSpeedZeroOptimizer(
^^^^^^^^^^^^^^^^^^^^^^^
File "/input/241205_gentle_monster/ST_test/SimpleTuner/.venv/lib/python3.11/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 395, in __init__
self.device).clone().float().detach()
^^^^^^^
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 44.34 GiB. GPU 0 has a total capacity of 79.15 GiB of which 33.69 GiB is free. Process 3704516 has 45.45 GiB memory in use. Of the allocated memory 44.35 GiB is allocated by PyTorch, and 37.60 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)