bghira/SimpleTuner

Full finetune memory issue on A100 80GB with DeepSpeed

Closed this issue · 2 comments

Hi

I encountered a CUDA out of memory (OOM) issue while training a model on an NVIDIA A100 80GB GPU using DeepSpeed ZeRO Stage 2. Despite following the configuration provided in Issue #852, the OOM issue persists.

I used following settings

config.env:

DATALOADER_CONFIG='config/multidatabackend.json'
ASPECT_BUCKET_ROUNDING='2'
TRAINING_SEED='42'
USE_EMA='false'
USE_XFORMERS='false'
MINIMUM_RESOLUTION='0'
OUTPUT_DIR='output'
USE_DORA='false'
USE_BITFIT='false'
PUSH_TO_HUB='false'
PUSH_CHECKPOINTS='false'
NUM_EPOCHS='100'
MAX_NUM_STEPS='0'
CHECKPOINTING_STEPS='2000'
CHECKPOINTING_LIMIT='10'
HUB_MODEL_NAME='simpletuner-full'
TRACKER_PROJECT_NAME='simpletuner-flux-dev'
TRACKER_RUN_NAME='flux-dev'
DEBUG_EXTRA_ARGS='--report_to=wandb'
MODEL_TYPE='full'
MODEL_NAME='black-forest-labs/FLUX.1-dev'
FLUX='true'
PIXART_SIGMA='false'
KOLORS='false'
STABLE_DIFFUSION_3='false'
STABLE_DIFFUSION_LEGACY='false'
TRAIN_BATCH_SIZE='1'
USE_GRADIENT_CHECKPOINTING='true'
GRADIENT_ACCUMULATION_STEPS='2'
CAPTION_DROPOUT_PROBABILITY='0.1'
RESOLUTION_TYPE='pixel_area'
RESOLUTION='1024'
VALIDATION_SEED='0'
VALIDATION_STEPS='2000'
VALIDATION_RESOLUTION='1024x1024'
VALIDATION_GUIDANCE='3.0'
VALIDATION_GUIDANCE_RESCALE='0.0'
VALIDATION_NUM_INFERENCE_STEPS='25'
VALIDATION_PROMPT='A photo-realistic image of a cat'
ALLOW_TF32='true'
MIXED_PRECISION='bf16'
OPTIMIZER='adamw_bf16'
LEARNING_RATE='1e-6'
LR_SCHEDULE='polynomial'
LR_WARMUP_STEPS='10'
TRAINING_NUM_PROCESSES='1'
TRAINING_NUM_MACHINES='1'
VALIDATION_TORCH_COMPILE='true'
TRAINER_DYNAMO_BACKEND='inductor'
TRAINER_EXTRA_ARGS='--lr_end=1e-8 --compress_disk_cache'
MODEL_FAMILY='flux'

multidatabackend.json:

[
    {
        "id": "laion_ye_pop",
        "type": "local",
        "instance_data_dir": "/input/SimpleTuner/data",
        "crop": false,
        "crop_style": "center",
        "crop_aspect": "square",
        "minimum_image_size": 1024,
        "maximum_image_size": 1536,
        "target_downsample_size": 1024,
        "resolution": 1024,
        "resolution_type": "pixel_area",
        "caption_strategy": "textfile",
        "cache_dir_vae": "cache//vae-1024",
        "text_embeds": "laion-ye-pop-embed-cache",
        "ignore_epochs": false,
        "disabled": false,
        "skip_file_discovery": "",
        "metadata_backend": "json"
    },
    {
        "id": "laion-ye-pop-embed-cache",
        "dataset_type": "text_embeds",
        "default": true,
        "type": "local",
        "cache_dir": "cache//text",
        "disabled": false,
        "write_batch_size": 128
    }
]

deepspeed_config.yaml:

compute_environment: LOCAL_MACHINE
debug: true
deepspeed_config:
  gradient_accumulation_steps: 2
  gradient_clipping: 1.0
  offload_optimizer_device: none
  offload_param_device: none
  zero3_init_flag: false
  zero_stage: 2
distributed_type: DEEPSPEED
downcast_bf16: 'no'
enable_cpu_affinity: false
machine_rank: 0
main_training_function: main
mixed_precision: bf16
num_machines: 1
num_processes: 1
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

I got this error:

[2024-12-16 23:58:27,174] [INFO] [logging.py:129:log_dist] [Rank 0] Creating torch.bfloat16 ZeRO stage 2 optimizer
[2024-12-16 23:58:27,174] [INFO] [stage_1_and_2.py:149:__init__] Reduce bucket size 500000000
[2024-12-16 23:58:27,174] [INFO] [stage_1_and_2.py:150:__init__] Allgather bucket size 500000000
[2024-12-16 23:58:27,174] [INFO] [stage_1_and_2.py:151:__init__] CPU Offload: False
[2024-12-16 23:58:27,174] [INFO] [stage_1_and_2.py:152:__init__] Round robin gradient partitioning: False
CUDA out of memory. Tried to allocate 44.34 GiB. GPU 0 has a total capacity of 79.15 GiB of which 33.70 GiB is free. Process 3471184 has 45.45 GiB memory in use. Of the allocated memory 44.35 GiB is allocated by PyTorch, and 31.60 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

Could you please help investigate why DeepSpeed fails to utilize the available memory effectively on A100 80GB with these configurations?

please use config.json instead of config.env following the flux quickstart guide

Thank you for the comment,

However, I tried config.json also

{
    "--resume_from_checkpoint": "latest",
    "--data_backend_config": "config/multidatabackend.json",
    "--aspect_bucket_rounding": 2,
    "--seed": 42,
    "--minimum_image_size": 0,
    "--disable_benchmark": false,
    "--output_dir": "output/models",
    "--num_train_epochs": 100,
    "--max_train_steps": 0,
    "--checkpointing_steps": 500,
    "--checkpoints_total_limit": 20,
    "--attention_mechanism": "diffusers",
    "--report_to": "none",
    "--model_type": "full",
    "--pretrained_model_name_or_path": "black-forest-labs/FLUX.1-dev",
    "--model_family": "flux",
    "--train_batch_size": 1,
    "--gradient_checkpointing": "true",
    "--caption_dropout_probability": 0.0,
    "--resolution_type": "pixel_area",
    "--resolution": 1024,
    "--validation_seed": 42,
    "--validation_steps": 500,
    "--validation_resolution": "1024x1024",
    "--validation_guidance": 3.0,
    "--validation_guidance_rescale": "0.0",
    "--validation_num_inference_steps": "20",
    "--validation_prompt": "A photo-realistic image of a cat",
    "--mixed_precision": "bf16",
    "--optimizer": "adamw_bf16",
    "--learning_rate": "5e-5",
    "--lr_scheduler": "polynomial",
    "--lr_warmup_steps": "0",
    "--validation_torch_compile": "false"
}

and still got the error

[2024-12-17 03:02:12,382] [INFO] [logging.py:129:log_dist] [Rank 0] Creating torch.bfloat16 ZeRO stage 2 optimizer
[2024-12-17 03:02:12,382] [INFO] [stage_1_and_2.py:149:__init__] Reduce bucket size 500000000
[2024-12-17 03:02:12,382] [INFO] [stage_1_and_2.py:150:__init__] Allgather bucket size 500000000
[2024-12-17 03:02:12,382] [INFO] [stage_1_and_2.py:151:__init__] CPU Offload: False
[2024-12-17 03:02:12,382] [INFO] [stage_1_and_2.py:152:__init__] Round robin gradient partitioning: False
CUDA out of memory. Tried to allocate 44.34 GiB. GPU 0 has a total capacity of 79.15 GiB of which 33.69 GiB is free. Process 3704516 has 45.45 GiB memory in use. Of the allocated memory 44.35 GiB is allocated by PyTorch, and 37.60 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
Traceback (most recent call last):
  File "/input/241205_gentle_monster/ST_test/SimpleTuner/train.py", line 49, in <module>
    trainer.resume_and_prepare()
  File "/input/241205_gentle_monster/ST_test/SimpleTuner/helpers/training/trainer.py", line 1641, in resume_and_prepare
    self.init_prepare_models(lr_scheduler=lr_scheduler)
  File "/input/241205_gentle_monster/ST_test/SimpleTuner/helpers/training/trainer.py", line 1326, in init_prepare_models
    results = self.accelerator.prepare(
              ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/input/241205_gentle_monster/ST_test/SimpleTuner/.venv/lib/python3.11/site-packages/accelerate/accelerator.py", line 1318, in prepare
    result = self._prepare_deepspeed(*args)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/input/241205_gentle_monster/ST_test/SimpleTuner/.venv/lib/python3.11/site-packages/accelerate/accelerator.py", line 1815, in _prepare_deepspeed
    engine, optimizer, _, lr_scheduler = deepspeed.initialize(**kwargs)
                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/input/241205_gentle_monster/ST_test/SimpleTuner/.venv/lib/python3.11/site-packages/deepspeed/__init__.py", line 193, in initialize
    engine = DeepSpeedEngine(args=args,
             ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/input/241205_gentle_monster/ST_test/SimpleTuner/.venv/lib/python3.11/site-packages/deepspeed/runtime/engine.py", line 313, in __init__
    self._configure_optimizer(optimizer, model_parameters)
  File "/input/241205_gentle_monster/ST_test/SimpleTuner/.venv/lib/python3.11/site-packages/deepspeed/runtime/engine.py", line 1302, in _configure_optimizer
    self.optimizer = self._configure_zero_optimizer(basic_optimizer)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/input/241205_gentle_monster/ST_test/SimpleTuner/.venv/lib/python3.11/site-packages/deepspeed/runtime/engine.py", line 1560, in _configure_zero_optimizer
    optimizer = DeepSpeedZeroOptimizer(
                ^^^^^^^^^^^^^^^^^^^^^^^
  File "/input/241205_gentle_monster/ST_test/SimpleTuner/.venv/lib/python3.11/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 395, in __init__
    self.device).clone().float().detach()
                         ^^^^^^^
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 44.34 GiB. GPU 0 has a total capacity of 79.15 GiB of which 33.69 GiB is free. Process 3704516 has 45.45 GiB memory in use. Of the allocated memory 44.35 GiB is allocated by PyTorch, and 37.60 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)