[BUG] CLI not using GPUs
Closed this issue · 4 comments
Prerequisites
- I have read the documentation.
- I have checked other issues for similar problems.
Backend
Local
Interface Used
CLI
CLI Command
I have 2 A100 80gb and keep running out of memory when I try to finetune a 70b llama using CLI.
I added flash attention and the I get this You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')
Note: I have use GUI or interface and it works well with other A100 40gb.
It seems to me that CLI is not using the gpus. Well, the CLI is not using the gpus:
UI Screenshots & Parameters
Load Miniconda module
module load miniconda
Activate the environment
source activate autot3
ENV
export HF_USERNAME= cr7
export HF_TOKEN=somethings435uj
autotrain --config llama_train.yml
YML File
accelerate:
multi_gpu: true
num_processes: 2
mixed_precision: bf16
task: llm-sft
base_model: meta-llama/Meta-Llama-3-70B-Instruct
project_name: autotrain-llama3-70b-generic-5
log: tensorboard
backend: local
data:
path: /home/datasets
train_split: train
valid_split: null
chat_template: null
column_mapping:
text_column: text_column
params:
block_size: 2048
model_max_length: 8192
epochs: 6
batch_size: 1
lr: 1e-5
peft: true
quantization: null
target_modules: all-linear
padding: right
optimizer: paged_adamw_8bit
scheduler: cosine
gradient_accumulation: 8
mixed_precision: bf16
use_flash_attention_2: True
hub:
username: ${HF_USERNAME}
token: ${HF_TOKEN}
push_to_hub: true
Same yml file without flash attention:
accelerate:
multi_gpu: true
num_processes: 2
task: llm-sft
base_model: meta-llama/Meta-Llama-3-70B-Instruct
project_name: autotrain-llama3-70b-generic-5
log: tensorboard
backend: local
data:
path: /home/datasets
train_split: train
valid_split: null
chat_template: null
column_mapping:
text_column: text_column
params:
block_size: 2048
model_max_length: 8192
epochs: 6
batch_size: 1
lr: 1e-5
peft: true
quantization: null
target_modules: all-linear
padding: right
optimizer: paged_adamw_8bit
scheduler: cosine
gradient_accumulation: 8
mixed_precision: bf16
hub:
username: ${HF_USERNAME}
token: ${HF_TOKEN}
push_to_hub: true
Error Logs
With Flash Attention
You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')
Without Flash Attention I run out of memory with 2 A100 80gb and 4 cpus and 200gb of ram
INFO | 2024-09-24 08:27:08 | autotrain.parser:run:211 - {'model': 'meta-llama/Meta-Llama-3-70B-Instruct', 'project_name': 'autotrain-llama3-70b-generic-5', 'data_path': '/home/datasets', 'train_split':
'train', 'valid_split': None, 'add_eos_token': True, 'block_size': 2048, 'model_max_length': 8192, 'padding': 'right', 'trainer': 'sft', 'use_flash_attention_2': False, 'log': 'tensorboard', 'disable_gradient_checkpointing': False, 'logging_steps': -1, 'eval_strategy': 'epoch', 'save_total_limit': 1, 'auto_find_batch_size': False, 'mixed_precision': 'bf16', 'lr': 1e-05, 'epochs': 6, 'batch_size': 1, 'warmup_ratio': 0.1, 'gradient_accumulation': 8, 'optimizer': 'paged_adamw_8bit', 'scheduler': 'cosine', 'weight_decay': 0.0, 'max_grad_norm': 1.0, 'seed': 42, 'chat_template': None, 'quantization': None, 'target_modules': 'all-linear', 'merge_adapter': False, 'peft': True, 'lora_r': 16, 'lora_alpha': 32, 'lora_dropout': 0.05, 'model_ref': None, 'dpo_beta': 0.1, 'max_prompt_length': 128, 'max_completion_length': None, 'prompt_text_c
olumn': None, 'text_column': 'text_column', 'rejected_text_column': None, 'push_to_hub': True, 'username': 'AIlchemist', 'token': '', 'unsloth': False}
INFO | 2024-09-24 08:27:08 | autotrain.backends.local:create:8 - Starting local training...
INFO | 2024-09-24 08:27:08 | autotrain.commands:launch_command:489 - ['accelerate', 'launch', '--num_machines', '1', '--num_processes', '1', '--mixed_precision', 'bf16', '-m', 'autotrain.trainers.clm', '--t
raining_config', 'autotrain-llama3-70b-generic-5/training_params.json']
INFO | 2024-09-24 08:27:08 | autotrain.commands:launch_command:490 - {'model': 'meta-llama/Meta-Llama-3-70B-Instruct', 'project_name': 'autotrain-llama3-70b-generic-5', 'data_path': '/home/datasets', '
train_split': 'train', 'valid_split': None, 'add_eos_token': True, 'block_size': 2048, 'model_max_length': 8192, 'padding': 'right', 'trainer': 'sft', 'use_flash_attention_2': False, 'log': 'tensorboard', 'disa
ble_gradient_checkpointing': False, 'logging_steps': -1, 'eval_strategy': 'epoch', 'save_total_limit': 1, 'auto_find_batch_size': False, 'mixed_precision': 'bf16', 'lr': 1e-05, 'epochs': 6, 'batch_size': 1, 'wa
rmup_ratio': 0.1, 'gradient_accumulation': 8, 'optimizer': 'paged_adamw_8bit', 'scheduler': 'cosine', 'weight_decay': 0.0, 'max_grad_norm': 1.0, 'seed': 42, 'chat_template': None, 'quantization': None, 'target_
modules': 'all-linear', 'merge_adapter': False, 'peft': True, 'lora_r': 16, 'lora_alpha': 32, 'lora_dropout': 0.05, 'model_ref': None, 'dpo_beta': 0.1, 'max_prompt_length': 128, 'max_completion_length': None, '
prompt_text_column': None, 'text_column': 'text_column', 'rejected_text_column': None, 'push_to_hub': True, 'username': 'AIlchemist', 'token': '', 'unsloth': False}
The following values were not passed to accelerate launch
and had defaults used instead:
--dynamo_backend
was set to a value of 'no'
To avoid this warning pass in values for each of the problematic parameters or run accelerate config
.
[2024-09-24 08:27:20,515] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
Warning: The default cache directory for DeepSpeed Triton autotune, /home/.triton/autotune, appears to be on an NFS system. While this is generally acceptable, if you experience slowdowns or hanging when D
eepSpeed exits, it is recommended to set the TRITON_CACHE_DIR environment variable to a non-NFS path.
INFO | 2024-09-24 08:27:22 | autotrain.trainers.clm.train_clm_sft:train:11 - Starting SFT training...
INFO | 2024-09-24 08:27:22 | autotrain.trainers.clm.utils:process_input_data:398 - Train data: Dataset({
features: ['text_column'],
num_rows: 8382
})
INFO | 2024-09-24 08:27:22 | autotrain.trainers.clm.utils:process_input_data:399 - Valid data: None
INFO | 2024-09-24 08:27:22 | autotrain.trainers.clm.utils:configure_logging_steps:471 - configuring logging steps
INFO | 2024-09-24 08:27:22 | autotrain.trainers.clm.utils:configure_logging_steps:484 - Logging steps: 25
INFO | 2024-09-24 08:27:22 | autotrain.trainers.clm.utils:configure_training_args:489 - configuring training args
INFO | 2024-09-24 08:27:22 | autotrain.trainers.clm.utils:configure_block_size:552 - Using block size 2048
INFO | 2024-09-24 08:27:22 | autotrain.trainers.clm.utils:get_model:587 - Can use unsloth: False
WARNING | 2024-09-24 08:27:22 | autotrain.trainers.clm.utils:get_model:629 - Unsloth not available, continuing without it...
INFO | 2024-09-24 08:27:22 | autotrain.trainers.clm.utils:get_model:631 - loading model config...
INFO | 2024-09-24 08:27:23 | autotrain.trainers.clm.utils:get_model:639 - loading model...
^MLoading checkpoint shards: 0%| | 0/30 [00:00<?, ?it/s]^MLoading checkpoint shards: 3%|▎ | 1/30 [00:02<01:14, 2.56s/it]^MLoading checkpoint shards: 7%|▋ | 2/30 [00:05<01:10, 2.
53s/it]^MLoading checkpoint shards: 10%|█ | 3/30 [00:07<01:09, 2.58s/it]^MLoading checkpoint shards: 13%|█▎ | 4/30 [00:10<01:07, 2.60s/it]^MLoading checkpoint shards: 17%|█▋ | 5/30 [0
0:12<01:03, 2.55s/it]^MLoading checkpoint shards: 20%|██ | 6/30 [00:15<01:00, 2.52s/it]^MLoading checkpoint shards: 23%|██▎ | 7/30 [00:17<00:57, 2.49s/it]^MLoading checkpoint shards: 27%|██▋
| 8/30 [00:20<00:55, 2.53s/it]^MLoading checkpoint shards: 30%|███ | 9/30 [00:22<00:53, 2.54s/it]^MLoading checkpoint shards: 33%|███▎ | 10/30 [00:25<00:49, 2.50s/it]^MLoading checkpoint s
hards: 37%|███▋ | 11/30 [00:27<00:47, 2.48s/it]^MLoading checkpoint shards: 40%|████ | 12/30 [00:30<00:44, 2.46s/it]^MLoading checkpoint shards: 43%|████▎ | 13/30 [00:32<00:42, 2.49s/it]^MLo
ading checkpoint shards: 47%|████▋ | 14/30 [00:35<00:40, 2.51s/it]^MLoading checkpoint shards: 50%|█████ | 15/30 [00:37<00:37, 2.47s/it]^MLoading checkpoint shards: 53%|█████▎ | 16/30 [00:40<00:
34, 2.45s/it]^MLoading checkpoint shards: 57%|█████▋ | 17/30 [00:42<00:31, 2.43s/it]^MLoading checkpoint shards: 60%|██████ | 18/30 [00:44<00:29, 2.47s/it]^MLoading checkpoint shards: 63%|██████▎
| 19/30 [00:47<00:27, 2.48s/it]^MLoading checkpoint shards: 67%|██████▋ | 20/30 [00:49<00:24, 2.45s/it]^MLoading checkpoint shards: 70%|███████ | 21/30 [00:52<00:21, 2.43s/it]^MLoading checkpoint shard
s: 73%|███████▎ | 22/30 [00:54<00:19, 2.41s/it]Traceback (most recent call last):
File "/home/.conda/envs/autot3/bin/accelerate", line 8, in
sys.exit(main())
File "/home/.conda/envs/autot3/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 48, in main
args.func(args)
File "/home/.conda/envs/autot3/lib/python3.10/site-packages/accelerate/commands/launch.py", line 1106, in launch_command
simple_launcher(args)
File "/home/.conda/envs/autot3/lib/python3.10/site-packages/accelerate/commands/launch.py", line 704, in simple_launcher
raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
subprocess.CalledProcessError: Command '['/home/.conda/envs/autot3/bin/python3.10', '-m', 'autotrain.trainers.clm', '--training_config', 'autotrain-llama3-70b-generic-5/training_params.json']' died with <S
ignals.SIGKILL: 9>.
INFO | 2024-09-24 08:28:20 | autotrain.parser:run:216 - Job ID: 392725
slurmstepd: error: Detected 1 oom_kill event in StepId=41974241.batch. Some of the step tasks have been OOM Killed.
###A more detail job description
Overall Utilization
CPU utilization [|||||||||||||||||||||||||||||||||||||||||||||||99%]
CPU memory usage [||||||||||||||||||||||||||||||||||||||||||||||100%]
GPU utilization [ 0%]
GPU memory usage [ 1%]
Detailed Utilization
================================================================================
CPU utilization per node (CPU time used/run time)
r813u29n11: 23:40:30/1-00:00:04 (efficiency=98.6%)
CPU memory usage per node - used/allocated
r813u29n11: 199.6GB/200.0GB (49.9GB/50.0GB per core of 4)
GPU utilization per node
r813u29n11 (GPU 0): 0% <--- GPU was not used
r813u29n11 (GPU 1): 0% <--- GPU was not used
GPU memory usage per node - maximum used/total
r813u29n11 (GPU 0): 879.1MB/80.0GB (1.1%)
r813u29n11 (GPU 1): 1.3GB/80.0GB (1.6%)
Additional Information
No response
first of all, you dont need the accelerate part in the config yaml.
the error you are getting is both GPUs going out of memory.
you can find example configs here: https://github.com/huggingface/autotrain-advanced/tree/main/configs/llm_finetuning
For the config for llama 70b: https://github.com/huggingface/autotrain-advanced/blob/main/configs/llm_finetuning/llama3-70b-sft.yml, it was on 8xH100. You might be able to run it on 2xA100 if you use int4 quantization and maybe lower the model max length.
let me know if that doesnt work either. :)
Thank you so much for all your work creating autotrain and help. This works! One more question, At first it was running with two gpus just fine, but then I became greedy and I wanted to add more gpus. The it started to run only on one gpu:
INFO | 2024-09-25 12:34:15 | autotrain.commands:launch_command:489 - ['accelerate', 'launch', '--num_machines', '1', '--num_processes', '1', '--mixed_precision', 'bf16', '-m', 'autotrain.trainers.clm', '--training_config', 'autotrain-llama3-70b-generic-13/training_params.json']
How do I force it to run on two or more GPUs. They are all A100 80gbs
if the autotrain
command can see two GPUs, it will use both automatically. It seems like for some reason it doesnt see the second GPU. Can you run like: CUDA_VISIBLE_DEVICES=0,1 autotrain .......
?
You are the best! Works like a charm! Thanks!