[BUG] CLI not using GPUs

Question

[BUG] CLI not using GPUs

Closed this issue 3 months ago · 4 comments

elgraniti commented 3 months ago

Prerequisites

I have read the documentation.
I have checked other issues for similar problems.

Backend

Local

Interface Used

CLI

CLI Command

I have 2 A100 80gb and keep running out of memory when I try to finetune a 70b llama using CLI.

I added flash attention and the I get this You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')

Note: I have use GUI or interface and it works well with other A100 40gb.

It seems to me that CLI is not using the gpus. Well, the CLI is not using the gpus:

UI Screenshots & Parameters

Load Miniconda module

module load miniconda

Activate the environment

source activate autot3

ENV

export HF_USERNAME= cr7
export HF_TOKEN=somethings435uj

autotrain --config llama_train.yml

YML File

accelerate:
multi_gpu: true
num_processes: 2
mixed_precision: bf16

task: llm-sft
base_model: meta-llama/Meta-Llama-3-70B-Instruct
project_name: autotrain-llama3-70b-generic-5
log: tensorboard
backend: local

data:
path: /home/datasets
train_split: train
valid_split: null
chat_template: null
column_mapping:
text_column: text_column

params:
block_size: 2048
model_max_length: 8192
epochs: 6
batch_size: 1
lr: 1e-5
peft: true
quantization: null
target_modules: all-linear
padding: right
optimizer: paged_adamw_8bit
scheduler: cosine
gradient_accumulation: 8
mixed_precision: bf16
use_flash_attention_2: True

hub:
username: ${HF_USERNAME}
token: ${HF_TOKEN}
push_to_hub: true

Same yml file without flash attention:

accelerate:
multi_gpu: true
num_processes: 2

task: llm-sft
base_model: meta-llama/Meta-Llama-3-70B-Instruct
project_name: autotrain-llama3-70b-generic-5
log: tensorboard
backend: local

data:
path: /home/datasets
train_split: train
valid_split: null
chat_template: null
column_mapping:
text_column: text_column

params:
block_size: 2048
model_max_length: 8192
epochs: 6
batch_size: 1
lr: 1e-5
peft: true
quantization: null
target_modules: all-linear
padding: right
optimizer: paged_adamw_8bit
scheduler: cosine
gradient_accumulation: 8
mixed_precision: bf16

hub:
username: ${HF_USERNAME}
token: ${HF_TOKEN}
push_to_hub: true

Error Logs

With Flash Attention

You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')

Without Flash Attention I run out of memory with 2 A100 80gb and 4 cpus and 200gb of ram

INFO | 2024-09-24 08:27:08 | autotrain.parser:run:211 - {'model': 'meta-llama/Meta-Llama-3-70B-Instruct', 'project_name': 'autotrain-llama3-70b-generic-5', 'data_path': '/home/datasets', 'train_split':
'train', 'valid_split': None, 'add_eos_token': True, 'block_size': 2048, 'model_max_length': 8192, 'padding': 'right', 'trainer': 'sft', 'use_flash_attention_2': False, 'log': 'tensorboard', 'disable_gradient_checkpointing': False, 'logging_steps': -1, 'eval_strategy': 'epoch', 'save_total_limit': 1, 'auto_find_batch_size': False, 'mixed_precision': 'bf16', 'lr': 1e-05, 'epochs': 6, 'batch_size': 1, 'warmup_ratio': 0.1, 'gradient_accumulation': 8, 'optimizer': 'paged_adamw_8bit', 'scheduler': 'cosine', 'weight_decay': 0.0, 'max_grad_norm': 1.0, 'seed': 42, 'chat_template': None, 'quantization': None, 'target_modules': 'all-linear', 'merge_adapter': False, 'peft': True, 'lora_r': 16, 'lora_alpha': 32, 'lora_dropout': 0.05, 'model_ref': None, 'dpo_beta': 0.1, 'max_prompt_length': 128, 'max_completion_length': None, 'prompt_text_c
olumn': None, 'text_column': 'text_column', 'rejected_text_column': None, 'push_to_hub': True, 'username': 'AIlchemist', 'token': '', 'unsloth': False}
INFO | 2024-09-24 08:27:08 | autotrain.backends.local:create:8 - Starting local training...
INFO | 2024-09-24 08:27:08 | autotrain.commands:launch_command:489 - ['accelerate', 'launch', '--num_machines', '1', '--num_processes', '1', '--mixed_precision', 'bf16', '-m', 'autotrain.trainers.clm', '--t
raining_config', 'autotrain-llama3-70b-generic-5/training_params.json']
INFO | 2024-09-24 08:27:08 | autotrain.commands:launch_command:490 - {'model': 'meta-llama/Meta-Llama-3-70B-Instruct', 'project_name': 'autotrain-llama3-70b-generic-5', 'data_path': '/home/datasets', '
train_split': 'train', 'valid_split': None, 'add_eos_token': True, 'block_size': 2048, 'model_max_length': 8192, 'padding': 'right', 'trainer': 'sft', 'use_flash_attention_2': False, 'log': 'tensorboard', 'disa
ble_gradient_checkpointing': False, 'logging_steps': -1, 'eval_strategy': 'epoch', 'save_total_limit': 1, 'auto_find_batch_size': False, 'mixed_precision': 'bf16', 'lr': 1e-05, 'epochs': 6, 'batch_size': 1, 'wa
rmup_ratio': 0.1, 'gradient_accumulation': 8, 'optimizer': 'paged_adamw_8bit', 'scheduler': 'cosine', 'weight_decay': 0.0, 'max_grad_norm': 1.0, 'seed': 42, 'chat_template': None, 'quantization': None, 'target_
modules': 'all-linear', 'merge_adapter': False, 'peft': True, 'lora_r': 16, 'lora_alpha': 32, 'lora_dropout': 0.05, 'model_ref': None, 'dpo_beta': 0.1, 'max_prompt_length': 128, 'max_completion_length': None, '
prompt_text_column': None, 'text_column': 'text_column', 'rejected_text_column': None, 'push_to_hub': True, 'username': 'AIlchemist', 'token': '', 'unsloth': False}
The following values were not passed to accelerate launch and had defaults used instead:
--dynamo_backend was set to a value of 'no'
To avoid this warning pass in values for each of the problematic parameters or run accelerate config.
[2024-09-24 08:27:20,515] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
Warning: The default cache directory for DeepSpeed Triton autotune, /home/.triton/autotune, appears to be on an NFS system. While this is generally acceptable, if you experience slowdowns or hanging when D
eepSpeed exits, it is recommended to set the TRITON_CACHE_DIR environment variable to a non-NFS path.
INFO | 2024-09-24 08:27:22 | autotrain.trainers.clm.train_clm_sft:train:11 - Starting SFT training...
INFO | 2024-09-24 08:27:22 | autotrain.trainers.clm.utils:process_input_data:398 - Train data: Dataset({
features: ['text_column'],
num_rows: 8382
})
INFO | 2024-09-24 08:27:22 | autotrain.trainers.clm.utils:process_input_data:399 - Valid data: None
INFO | 2024-09-24 08:27:22 | autotrain.trainers.clm.utils:configure_logging_steps:471 - configuring logging steps
INFO | 2024-09-24 08:27:22 | autotrain.trainers.clm.utils:configure_logging_steps:484 - Logging steps: 25
INFO | 2024-09-24 08:27:22 | autotrain.trainers.clm.utils:configure_training_args:489 - configuring training args
INFO | 2024-09-24 08:27:22 | autotrain.trainers.clm.utils:configure_block_size:552 - Using block size 2048
INFO | 2024-09-24 08:27:22 | autotrain.trainers.clm.utils:get_model:587 - Can use unsloth: False
WARNING | 2024-09-24 08:27:22 | autotrain.trainers.clm.utils:get_model:629 - Unsloth not available, continuing without it...
INFO | 2024-09-24 08:27:22 | autotrain.trainers.clm.utils:get_model:631 - loading model config...
INFO | 2024-09-24 08:27:23 | autotrain.trainers.clm.utils:get_model:639 - loading model...
^MLoading checkpoint shards: 0%| | 0/30 [00:00<?, ?it/s]^MLoading checkpoint shards: 3%|▎ | 1/30 [00:02<01:14, 2.56s/it]^MLoading checkpoint shards: 7%|▋ | 2/30 [00:05<01:10, 2.
53s/it]^MLoading checkpoint shards: 10%|█ | 3/30 [00:07<01:09, 2.58s/it]^MLoading checkpoint shards: 13%|█▎ | 4/30 [00:10<01:07, 2.60s/it]^MLoading checkpoint shards: 17%|█▋ | 5/30 [0
0:12<01:03, 2.55s/it]^MLoading checkpoint shards: 20%|██ | 6/30 [00:15<01:00, 2.52s/it]^MLoading checkpoint shards: 23%|██▎ | 7/30 [00:17<00:57, 2.49s/it]^MLoading checkpoint shards: 27%|██▋
| 8/30 [00:20<00:55, 2.53s/it]^MLoading checkpoint shards: 30%|███ | 9/30 [00:22<00:53, 2.54s/it]^MLoading checkpoint shards: 33%|███▎ | 10/30 [00:25<00:49, 2.50s/it]^MLoading checkpoint s
hards: 37%|███▋ | 11/30 [00:27<00:47, 2.48s/it]^MLoading checkpoint shards: 40%|████ | 12/30 [00:30<00:44, 2.46s/it]^MLoading checkpoint shards: 43%|████▎ | 13/30 [00:32<00:42, 2.49s/it]^MLo
ading checkpoint shards: 47%|████▋ | 14/30 [00:35<00:40, 2.51s/it]^MLoading checkpoint shards: 50%|█████ | 15/30 [00:37<00:37, 2.47s/it]^MLoading checkpoint shards: 53%|█████▎ | 16/30 [00:40<00:
34, 2.45s/it]^MLoading checkpoint shards: 57%|█████▋ | 17/30 [00:42<00:31, 2.43s/it]^MLoading checkpoint shards: 60%|██████ | 18/30 [00:44<00:29, 2.47s/it]^MLoading checkpoint shards: 63%|██████▎
| 19/30 [00:47<00:27, 2.48s/it]^MLoading checkpoint shards: 67%|██████▋ | 20/30 [00:49<00:24, 2.45s/it]^MLoading checkpoint shards: 70%|███████ | 21/30 [00:52<00:21, 2.43s/it]^MLoading checkpoint shard
s: 73%|███████▎ | 22/30 [00:54<00:19, 2.41s/it]Traceback (most recent call last):
File "/home/.conda/envs/autot3/bin/accelerate", line 8, in
sys.exit(main())
File "/home/.conda/envs/autot3/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 48, in main
args.func(args)
File "/home/.conda/envs/autot3/lib/python3.10/site-packages/accelerate/commands/launch.py", line 1106, in launch_command
simple_launcher(args)
File "/home/.conda/envs/autot3/lib/python3.10/site-packages/accelerate/commands/launch.py", line 704, in simple_launcher
raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
subprocess.CalledProcessError: Command '['/home/.conda/envs/autot3/bin/python3.10', '-m', 'autotrain.trainers.clm', '--training_config', 'autotrain-llama3-70b-generic-5/training_params.json']' died with <S
ignals.SIGKILL: 9>.
INFO | 2024-09-24 08:28:20 | autotrain.parser:run:216 - Job ID: 392725
slurmstepd: error: Detected 1 oom_kill event in StepId=41974241.batch. Some of the step tasks have been OOM Killed.

###A more detail job description

Overall Utilization

CPU utilization [|||||||||||||||||||||||||||||||||||||||||||||||99%]

CPU memory usage [||||||||||||||||||||||||||||||||||||||||||||||100%]

GPU utilization [ 0%]

GPU memory usage [ 1%]

                          Detailed Utilization

================================================================================
CPU utilization per node (CPU time used/run time)
r813u29n11: 23:40:30/1-00:00:04 (efficiency=98.6%)
CPU memory usage per node - used/allocated
r813u29n11: 199.6GB/200.0GB (49.9GB/50.0GB per core of 4)
GPU utilization per node
r813u29n11 (GPU 0): 0% <--- GPU was not used
r813u29n11 (GPU 1): 0% <--- GPU was not used
GPU memory usage per node - maximum used/total
r813u29n11 (GPU 0): 879.1MB/80.0GB (1.1%)
r813u29n11 (GPU 1): 1.3GB/80.0GB (1.6%)

Additional Information

No response

Answer 1 · 2024-09-25T07:45:55.000Z

first of all, you dont need the accelerate part in the config yaml.
the error you are getting is both GPUs going out of memory.

you can find example configs here: https://github.com/huggingface/autotrain-advanced/tree/main/configs/llm_finetuning
For the config for llama 70b: https://github.com/huggingface/autotrain-advanced/blob/main/configs/llm_finetuning/llama3-70b-sft.yml, it was on 8xH100. You might be able to run it on 2xA100 if you use int4 quantization and maybe lower the model max length.
let me know if that doesnt work either. :)

Answer 2 · 2024-09-25T17:38:22.000Z

Thank you so much for all your work creating autotrain and help. This works! One more question, At first it was running with two gpus just fine, but then I became greedy and I wanted to add more gpus. The it started to run only on one gpu:

INFO | 2024-09-25 12:34:15 | autotrain.commands:launch_command:489 - ['accelerate', 'launch', '--num_machines', '1', '--num_processes', '1', '--mixed_precision', 'bf16', '-m', 'autotrain.trainers.clm', '--training_config', 'autotrain-llama3-70b-generic-13/training_params.json']

How do I force it to run on two or more GPUs. They are all A100 80gbs

Answer 3 · 2024-09-25T17:41:11.000Z

if the autotrain command can see two GPUs, it will use both automatically. It seems like for some reason it doesnt see the second GPU. Can you run like: CUDA_VISIBLE_DEVICES=0,1 autotrain .......?

Answer 4 · 2024-09-25T18:10:16.000Z

You are the best! Works like a charm! Thanks!