axolotl-ai-cloud/axolotl

OOM Training 70B on 4x3090 24GB both FSDP or Deepspeed Zero3

Nero10578 opened this issue · 15 comments

Please check that this issue hasn't been reported before.

  • I searched previous Bug Reports didn't find any similar reports.

Expected Behavior

I expect the model to be sharded across GPUs and be able to train 70B models loaded in 4-bit using QLORA on 4x24GB GPUs.

Current behaviour

Axolotl tries to load the model on each GPU and OOMs. Here I show FSDP since I have tried Deepspeed Zero 3 before and it didn't work either.


accelerate launch -m axolotl.cli.train qlora-sft.yml
The following values were not passed to `accelerate launch` and had defaults used instead:
        `--num_processes` was set to a value of `4`
                More than one GPU was found, enabling multi-GPU training.
                If this was unintended please pass in `--num_processes=1`.
        `--num_machines` was set to a value of `1`
        `--mixed_precision` was set to a value of `'no'`
        `--dynamo_backend` was set to a value of `'no'`
To avoid this warning pass in values for each of the problematic parameters or run `accelerate config`.
[2024-06-26 17:05:49,962] [INFO] [datasets.<module>:58] [PID:4035] PyTorch version 2.3.0 available.
[2024-06-26 17:05:49,963] [INFO] [datasets.<module>:58] [PID:4037] PyTorch version 2.3.0 available.
[2024-06-26 17:05:49,983] [INFO] [datasets.<module>:58] [PID:4036] PyTorch version 2.3.0 available.
[2024-06-26 17:05:49,990] [INFO] [datasets.<module>:58] [PID:4034] PyTorch version 2.3.0 available.
[2024-06-26 17:05:50,727] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-06-26 17:05:50,729] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-06-26 17:05:50,741] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-06-26 17:05:50,748] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-06-26 17:05:50,795] [INFO] [root.spawn:38] [PID:4035] gcc -pthread -B /home/administrator/miniconda3/envs/axolotl/compiler_compat -DNDEBUG -fwrapv -O2 -Wall -fPIC -O2 -isystem /home/administrator/miniconda3/envs/axolotl/include -fPIC -O2 -isystem /home/administrator/miniconda3/envs/axolotl/include -fPIC -c /tmp/tmpmnmihfxj/test.c -o /tmp/tmpmnmihfxj/test.o
[2024-06-26 17:05:50,797] [INFO] [root.spawn:38] [PID:4037] gcc -pthread -B /home/administrator/miniconda3/envs/axolotl/compiler_compat -DNDEBUG -fwrapv -O2 -Wall -fPIC -O2 -isystem /home/administrator/miniconda3/envs/axolotl/include -fPIC -O2 -isystem /home/administrator/miniconda3/envs/axolotl/include -fPIC -c /tmp/tmpwgu9demf/test.c -o /tmp/tmpwgu9demf/test.o
[2024-06-26 17:05:50,808] [INFO] [root.spawn:38] [PID:4036] gcc -pthread -B /home/administrator/miniconda3/envs/axolotl/compiler_compat -DNDEBUG -fwrapv -O2 -Wall -fPIC -O2 -isystem /home/administrator/miniconda3/envs/axolotl/include -fPIC -O2 -isystem /home/administrator/miniconda3/envs/axolotl/include -fPIC -c /tmp/tmpacnc2e4d/test.c -o /tmp/tmpacnc2e4d/test.o
[2024-06-26 17:05:50,815] [INFO] [root.spawn:38] [PID:4035] gcc -pthread -B /home/administrator/miniconda3/envs/axolotl/compiler_compat /tmp/tmpmnmihfxj/test.o -laio -o /tmp/tmpmnmihfxj/a.out
[2024-06-26 17:05:50,815] [INFO] [root.spawn:38] [PID:4034] gcc -pthread -B /home/administrator/miniconda3/envs/axolotl/compiler_compat -DNDEBUG -fwrapv -O2 -Wall -fPIC -O2 -isystem /home/administrator/miniconda3/envs/axolotl/include -fPIC -O2 -isystem /home/administrator/miniconda3/envs/axolotl/include -fPIC -c /tmp/tmp2lmopr_s/test.c -o /tmp/tmp2lmopr_s/test.o
[2024-06-26 17:05:50,816] [INFO] [root.spawn:38] [PID:4037] gcc -pthread -B /home/administrator/miniconda3/envs/axolotl/compiler_compat /tmp/tmpwgu9demf/test.o -laio -o /tmp/tmpwgu9demf/a.out
[2024-06-26 17:05:50,826] [INFO] [root.spawn:38] [PID:4036] gcc -pthread -B /home/administrator/miniconda3/envs/axolotl/compiler_compat /tmp/tmpacnc2e4d/test.o -laio -o /tmp/tmpacnc2e4d/a.out
[2024-06-26 17:05:50,833] [INFO] [root.spawn:38] [PID:4034] gcc -pthread -B /home/administrator/miniconda3/envs/axolotl/compiler_compat /tmp/tmp2lmopr_s/test.o -laio -o /tmp/tmp2lmopr_s/a.out
 [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
 [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
 [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
 [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.3
 [WARNING]  using untested triton version (2.3.0), only 1.0.0 is known to be compatible
 [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.3
 [WARNING]  using untested triton version (2.3.0), only 1.0.0 is known to be compatible
 [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
 [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.3
 [WARNING]  using untested triton version (2.3.0), only 1.0.0 is known to be compatible
 [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.3
 [WARNING]  using untested triton version (2.3.0), only 1.0.0 is known to be compatible
[2024-06-26 17:05:52,169] [DEBUG] [axolotl.normalize_config:80] [PID:4034] [RANK:0] bf16 support detected, enabling for this configuration.
[2024-06-26 17:05:52,171] [INFO] [axolotl.normalize_config:183] [PID:4034] [RANK:0] GPU memory usage baseline: 0.000GB (+0.350GB misc)
[2024-06-26 17:05:52,175] [DEBUG] [axolotl.normalize_config:80] [PID:4037] [RANK:3] bf16 support detected, enabling for this configuration.
[2024-06-26 17:05:52,177] [INFO] [axolotl.normalize_config:183] [PID:4037] [RANK:3] GPU memory usage baseline: 0.000GB (+0.331GB misc)
[2024-06-26 17:05:52,188] [DEBUG] [axolotl.normalize_config:80] [PID:4035] [RANK:1] bf16 support detected, enabling for this configuration.
[2024-06-26 17:05:52,209] [DEBUG] [axolotl.normalize_config:80] [PID:4036] [RANK:2] bf16 support detected, enabling for this configuration.
[2024-06-26 17:05:52,229] [INFO] [axolotl.normalize_config:183] [PID:4036] [RANK:2] GPU memory usage baseline: 0.000GB (+0.331GB misc)
[2024-06-26 17:05:52,229] [INFO] [axolotl.normalize_config:183] [PID:4035] [RANK:1] GPU memory usage baseline: 0.000GB (+0.331GB misc)
                                 dP            dP   dP 
                                 88            88   88 
      .d8888b. dP.  .dP .d8888b. 88 .d8888b. d8888P 88 
      88'  `88  `8bd8'  88'  `88 88 88'  `88   88   88 
      88.  .88  .d88b.  88.  .88 88 88.  .88   88   88 
      `88888P8 dP'  `dP `88888P' dP `88888P'   dP   dP 
                                                       
                                                       

****************************************
**** Axolotl Dependency Versions *****
  accelerate: 0.30.1         
        peft: 0.11.1         
transformers: 4.41.1         
         trl: 0.8.7.dev0     
       torch: 2.3.0          
bitsandbytes: 0.43.1         
****************************************
[2024-06-26 17:05:52,377] [WARNING] [axolotl.scripts.check_user_token:487] [PID:4034] [RANK:0] Error verifying HuggingFace token. Remember to log in using `huggingface-cli login` and get your access token from https://huggingface.co/settings/tokens if you want to use gated models or datasets.
[2024-06-26 17:05:52,526] [WARNING] [axolotl.scripts.check_user_token:487] [PID:4036] [RANK:2] Error verifying HuggingFace token. Remember to log in using `huggingface-cli login` and get your access token from https://huggingface.co/settings/tokens if you want to use gated models or datasets.
[2024-06-26 17:05:52,542] [WARNING] [axolotl.scripts.check_user_token:487] [PID:4035] [RANK:1] Error verifying HuggingFace token. Remember to log in using `huggingface-cli login` and get your access token from https://huggingface.co/settings/tokens if you want to use gated models or datasets.
[2024-06-26 17:05:52,542] [WARNING] [axolotl.scripts.check_user_token:487] [PID:4037] [RANK:3] Error verifying HuggingFace token. Remember to log in using `huggingface-cli login` and get your access token from https://huggingface.co/settings/tokens if you want to use gated models or datasets.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
[2024-06-26 17:05:52,748] [DEBUG] [axolotl.load_tokenizer:280] [PID:4034] [RANK:0] EOS: 128009 / <|eot_id|>
[2024-06-26 17:05:52,748] [DEBUG] [axolotl.load_tokenizer:281] [PID:4034] [RANK:0] BOS: 128000 / <|begin_of_text|>
[2024-06-26 17:05:52,748] [DEBUG] [axolotl.load_tokenizer:282] [PID:4034] [RANK:0] PAD: 128001 / <|end_of_text|>
[2024-06-26 17:05:52,749] [DEBUG] [axolotl.load_tokenizer:283] [PID:4034] [RANK:0] UNK: None / None
[2024-06-26 17:05:52,749] [INFO] [axolotl.load_tokenizer:294] [PID:4034] [RANK:0] No Chat template selected. Consider adding a chat template for easier inference.
[2024-06-26 17:05:52,749] [INFO] [axolotl.load_tokenized_prepared_datasets:179] [PID:4034] [RANK:0] Loading prepared dataset from disk at last_run_prepared/1062aa90746aa09fbd3b021d4933fc94...
[2024-06-26 17:05:52,752] [INFO] [axolotl.load_tokenized_prepared_datasets:181] [PID:4034] [RANK:0] Prepared dataset loaded from disk...
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
[2024-06-26 17:05:52,897] [DEBUG] [axolotl.load_tokenizer:280] [PID:4036] [RANK:2] EOS: 128009 / <|eot_id|>
[2024-06-26 17:05:52,897] [DEBUG] [axolotl.load_tokenizer:281] [PID:4036] [RANK:2] BOS: 128000 / <|begin_of_text|>
[2024-06-26 17:05:52,897] [DEBUG] [axolotl.load_tokenizer:282] [PID:4036] [RANK:2] PAD: 128001 / <|end_of_text|>
[2024-06-26 17:05:52,897] [DEBUG] [axolotl.load_tokenizer:283] [PID:4036] [RANK:2] UNK: None / None
[2024-06-26 17:05:52,897] [INFO] [axolotl.load_tokenizer:294] [PID:4036] [RANK:2] No Chat template selected. Consider adding a chat template for easier inference.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
[2024-06-26 17:05:52,935] [DEBUG] [axolotl.load_tokenizer:280] [PID:4035] [RANK:1] EOS: 128009 / <|eot_id|>
[2024-06-26 17:05:52,935] [DEBUG] [axolotl.load_tokenizer:281] [PID:4035] [RANK:1] BOS: 128000 / <|begin_of_text|>
[2024-06-26 17:05:52,935] [DEBUG] [axolotl.load_tokenizer:282] [PID:4035] [RANK:1] PAD: 128001 / <|end_of_text|>
[2024-06-26 17:05:52,935] [DEBUG] [axolotl.load_tokenizer:283] [PID:4035] [RANK:1] UNK: None / None
[2024-06-26 17:05:52,935] [INFO] [axolotl.load_tokenizer:294] [PID:4035] [RANK:1] No Chat template selected. Consider adding a chat template for easier inference.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
[2024-06-26 17:05:52,959] [DEBUG] [axolotl.load_tokenizer:280] [PID:4037] [RANK:3] EOS: 128009 / <|eot_id|>
[2024-06-26 17:05:52,959] [DEBUG] [axolotl.load_tokenizer:281] [PID:4037] [RANK:3] BOS: 128000 / <|begin_of_text|>
[2024-06-26 17:05:52,959] [DEBUG] [axolotl.load_tokenizer:282] [PID:4037] [RANK:3] PAD: 128001 / <|end_of_text|>
[2024-06-26 17:05:52,959] [DEBUG] [axolotl.load_tokenizer:283] [PID:4037] [RANK:3] UNK: None / None
[2024-06-26 17:05:52,959] [INFO] [axolotl.load_tokenizer:294] [PID:4037] [RANK:3] No Chat template selected. Consider adding a chat template for easier inference.
[2024-06-26 17:05:53,209] [INFO] [axolotl.load_tokenized_prepared_datasets:179] [PID:4035] [RANK:1] Loading prepared dataset from disk at last_run_prepared/1062aa90746aa09fbd3b021d4933fc94...
[2024-06-26 17:05:53,209] [INFO] [axolotl.load_tokenized_prepared_datasets:179] [PID:4036] [RANK:2] Loading prepared dataset from disk at last_run_prepared/1062aa90746aa09fbd3b021d4933fc94...
[2024-06-26 17:05:53,209] [INFO] [axolotl.load_tokenized_prepared_datasets:179] [PID:4037] [RANK:3] Loading prepared dataset from disk at last_run_prepared/1062aa90746aa09fbd3b021d4933fc94...
[2024-06-26 17:05:53,211] [DEBUG] [axolotl.calculate_total_num_steps:299] [PID:4034] [RANK:0] total_num_tokens: 16_871
[2024-06-26 17:05:53,212] [INFO] [axolotl.load_tokenized_prepared_datasets:181] [PID:4035] [RANK:1] Prepared dataset loaded from disk...
[2024-06-26 17:05:53,212] [DEBUG] [axolotl.calculate_total_num_steps:312] [PID:4034] [RANK:0] `total_supervised_tokens: 11_272`[2024-06-26 17:05:53,213] [DEBUG] [axolotl.calculate_total_num_steps:390] [PID:4034] [RANK:0] total_num_steps: 5
[2024-06-26 17:05:53,213] [INFO] [axolotl.load_tokenized_prepared_datasets:181] [PID:4037] [RANK:3] Prepared dataset loaded from disk...
[2024-06-26 17:05:53,213] [INFO] [axolotl.load_tokenized_prepared_datasets:181] [PID:4036] [RANK:2] Prepared dataset loaded from disk...
[2024-06-26 17:05:53,244] [DEBUG] [axolotl.train.train:56] [PID:4034] [RANK:0] loading tokenizer... /home/administrator/models/Meta-Llama-3-70B-Instruct-abliterated-v3.5
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
[2024-06-26 17:05:53,593] [DEBUG] [axolotl.load_tokenizer:280] [PID:4035] [RANK:1] EOS: 128009 / <|eot_id|>
[2024-06-26 17:05:53,593] [DEBUG] [axolotl.load_tokenizer:281] [PID:4035] [RANK:1] BOS: 128000 / <|begin_of_text|>
[2024-06-26 17:05:53,593] [DEBUG] [axolotl.load_tokenizer:282] [PID:4035] [RANK:1] PAD: 128001 / <|end_of_text|>
[2024-06-26 17:05:53,593] [DEBUG] [axolotl.load_tokenizer:283] [PID:4035] [RANK:1] UNK: None / None
[2024-06-26 17:05:53,593] [INFO] [axolotl.load_tokenizer:294] [PID:4035] [RANK:1] No Chat template selected. Consider adding a chat template for easier inference.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
[2024-06-26 17:05:53,602] [DEBUG] [axolotl.load_tokenizer:280] [PID:4034] [RANK:0] EOS: 128009 / <|eot_id|>
[2024-06-26 17:05:53,602] [DEBUG] [axolotl.load_tokenizer:281] [PID:4034] [RANK:0] BOS: 128000 / <|begin_of_text|>
[2024-06-26 17:05:53,602] [DEBUG] [axolotl.load_tokenizer:282] [PID:4034] [RANK:0] PAD: 128001 / <|end_of_text|>
[2024-06-26 17:05:53,602] [DEBUG] [axolotl.load_tokenizer:283] [PID:4034] [RANK:0] UNK: None / None
[2024-06-26 17:05:53,603] [INFO] [axolotl.load_tokenizer:294] [PID:4034] [RANK:0] No Chat template selected. Consider adding a chat template for easier inference.
[2024-06-26 17:05:53,603] [DEBUG] [axolotl.train.train:85] [PID:4034] [RANK:0] loading model and peft_config...
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
[2024-06-26 17:05:53,607] [DEBUG] [axolotl.load_tokenizer:280] [PID:4037] [RANK:3] EOS: 128009 / <|eot_id|>
[2024-06-26 17:05:53,607] [DEBUG] [axolotl.load_tokenizer:281] [PID:4037] [RANK:3] BOS: 128000 / <|begin_of_text|>
[2024-06-26 17:05:53,607] [DEBUG] [axolotl.load_tokenizer:282] [PID:4037] [RANK:3] PAD: 128001 / <|end_of_text|>
[2024-06-26 17:05:53,607] [DEBUG] [axolotl.load_tokenizer:283] [PID:4037] [RANK:3] UNK: None / None
[2024-06-26 17:05:53,607] [INFO] [axolotl.load_tokenizer:294] [PID:4037] [RANK:3] No Chat template selected. Consider adding a chat template for easier inference.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
[2024-06-26 17:05:53,661] [DEBUG] [axolotl.load_tokenizer:280] [PID:4036] [RANK:2] EOS: 128009 / <|eot_id|>
[2024-06-26 17:05:53,661] [DEBUG] [axolotl.load_tokenizer:281] [PID:4036] [RANK:2] BOS: 128000 / <|begin_of_text|>
[2024-06-26 17:05:53,661] [DEBUG] [axolotl.load_tokenizer:282] [PID:4036] [RANK:2] PAD: 128001 / <|end_of_text|>
[2024-06-26 17:05:53,661] [DEBUG] [axolotl.load_tokenizer:283] [PID:4036] [RANK:2] UNK: None / None
[2024-06-26 17:05:53,661] [INFO] [axolotl.load_tokenizer:294] [PID:4036] [RANK:2] No Chat template selected. Consider adding a chat template for easier inference.
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████| 30/30 [00:57<00:00,  1.93s/it]Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████| 30/30 [00:57<00:00,  1.93s/it]Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████| 30/30 [00:57<00:00,  1.92s/it]Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████| 30/30 [00:57<00:00,  1.92s/it][2024-06-26 17:06:52,692] [INFO] [axolotl.load_model:734] [PID:4036] [RANK:2] GPU memory usage after model load: 4.929GB (+0.216GB cache, +0.730GB misc)
[2024-06-26 17:06:52,692] [INFO] [axolotl.load_model:734] [PID:4034] [RANK:0] GPU memory usage after model load: 4.929GB (+0.216GB cache, +0.750GB misc)
[2024-06-26 17:06:52,693] [INFO] [axolotl.load_model:734] [PID:4035] [RANK:1] GPU memory usage after model load: 4.929GB (+0.216GB cache, +0.730GB misc)
[2024-06-26 17:06:52,694] [INFO] [axolotl.load_model:734] [PID:4037] [RANK:3] GPU memory usage after model load: 4.929GB (+0.216GB cache, +0.730GB misc)
[2024-06-26 17:06:52,697] [INFO] [axolotl.load_lora:951] [PID:4036] [RANK:2] found linear modules: ['k_proj', 'gate_proj', 'down_proj', 'up_proj', 'o_proj', 'q_proj', 'v_proj']
[2024-06-26 17:06:52,697] [INFO] [axolotl.load_lora:951] [PID:4034] [RANK:0] found linear modules: ['k_proj', 'up_proj', 'gate_proj', 'q_proj', 'o_proj', 'v_proj', 'down_proj']
[2024-06-26 17:06:52,698] [INFO] [axolotl.load_lora:951] [PID:4035] [RANK:1] found linear modules: ['o_proj', 'v_proj', 'k_proj', 'up_proj', 'down_proj', 'q_proj', 'gate_proj']
[2024-06-26 17:06:52,698] [INFO] [axolotl.load_lora:951] [PID:4037] [RANK:3] found linear modules: ['v_proj', 'up_proj', 'gate_proj', 'down_proj', 'k_proj', 'q_proj', 'o_proj']
trainable params: 103,546,880 || all params: 70,657,253,376 || trainable%: 0.1465
[rank2]: Traceback (most recent call last):
[rank2]:   File "<frozen runpy>", line 198, in _run_module_as_main
[rank2]:   File "<frozen runpy>", line 88, in _run_code
[rank2]:   File "/home/administrator/axolotl/src/axolotl/cli/train.py", line 70, in <module>
[rank2]:     fire.Fire(do_cli)
[rank2]:   File "/home/administrator/miniconda3/envs/axolotl/lib/python3.11/site-packages/fire/core.py", line 143, in Fire
[rank2]:     component_trace = _Fire(component, args, parsed_flag_args, context, name)
[rank2]:                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank2]:   File "/home/administrator/miniconda3/envs/axolotl/lib/python3.11/site-packages/fire/core.py", line 477, in _Fire
[rank2]:     component, remaining_args = _CallAndUpdateTrace(
[rank2]:                                 ^^^^^^^^^^^^^^^^^^^^
[rank2]:   File "/home/administrator/miniconda3/envs/axolotl/lib/python3.11/site-packages/fire/core.py", line 693, in _CallAndUpdateTrace
[rank2]:     component = fn(*varargs, **kwargs)
[rank2]:                 ^^^^^^^^^^^^^^^^^^^^^^
[rank2]:   File "/home/administrator/axolotl/src/axolotl/cli/train.py", line 38, in do_cli
[rank2]:     return do_train(parsed_cfg, parsed_cli_args)
[rank2]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank2]:   File "/home/administrator/axolotl/src/axolotl/cli/train.py", line 66, in do_train
[rank2]:     return train(cfg=cfg, cli_args=cli_args, dataset_meta=dataset_meta)
[rank2]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank2]:   File "/home/administrator/axolotl/src/axolotl/train.py", line 88, in train
[rank2]:     model, peft_config = load_model(cfg, tokenizer, inference=cli_args.inference)
[rank2]:                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank2]:   File "/home/administrator/axolotl/src/axolotl/utils/models.py", line 822, in load_model
[rank2]:     model.to(f"cuda:{cfg.local_rank}")
[rank2]:   File "/home/administrator/miniconda3/envs/axolotl/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1173, in to
[rank2]:     return self._apply(convert)
[rank2]:            ^^^^^^^^^^^^^^^^^^^^
[rank2]:   File "/home/administrator/miniconda3/envs/axolotl/lib/python3.11/site-packages/torch/nn/modules/module.py", line 779, in _apply
[rank2]:     module._apply(fn)
[rank2]:   File "/home/administrator/miniconda3/envs/axolotl/lib/python3.11/site-packages/torch/nn/modules/module.py", line 779, in _apply
[rank2]:     module._apply(fn)
[rank2]:   File "/home/administrator/miniconda3/envs/axolotl/lib/python3.11/site-packages/torch/nn/modules/module.py", line 779, in _apply
[rank2]:     module._apply(fn)
[rank2]:   [Previous line repeated 5 more times]
[rank2]:   File "/home/administrator/miniconda3/envs/axolotl/lib/python3.11/site-packages/torch/nn/modules/module.py", line 804, in _apply
[rank2]:     param_applied = fn(param)
[rank2]:                     ^^^^^^^^^
[rank2]:   File "/home/administrator/miniconda3/envs/axolotl/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1159, in convert
[rank2]:     return t.to(
[rank2]:            ^^^^^
[rank2]:   File "/home/administrator/miniconda3/envs/axolotl/lib/python3.11/site-packages/bitsandbytes/nn/modules.py", line 330, in to
[rank2]:     super().to(device=device, dtype=dtype, non_blocking=non_blocking),
[rank2]:     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank2]: torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU  has a total capacity of 23.68 GiB of which 60.31 MiB is free. Including non-PyTorch memory, this process has 23.61 GiB memory in use. Of the allocated memory 23.15 GiB is allocated by PyTorch, and 64.54 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
[rank3]: Traceback (most recent call last):
[rank3]:   File "<frozen runpy>", line 198, in _run_module_as_main
[rank3]:   File "<frozen runpy>", line 88, in _run_code
[rank3]:   File "/home/administrator/axolotl/src/axolotl/cli/train.py", line 70, in <module>
[rank3]:     fire.Fire(do_cli)
[rank3]:   File "/home/administrator/miniconda3/envs/axolotl/lib/python3.11/site-packages/fire/core.py", line 143, in Fire
[rank3]:     component_trace = _Fire(component, args, parsed_flag_args, context, name)
[rank3]:                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/home/administrator/miniconda3/envs/axolotl/lib/python3.11/site-packages/fire/core.py", line 477, in _Fire
[rank3]:     component, remaining_args = _CallAndUpdateTrace(
[rank3]:                                 ^^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/home/administrator/miniconda3/envs/axolotl/lib/python3.11/site-packages/fire/core.py", line 693, in _CallAndUpdateTrace
[rank3]:     component = fn(*varargs, **kwargs)
[rank3]:                 ^^^^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/home/administrator/axolotl/src/axolotl/cli/train.py", line 38, in do_cli
[rank3]:     return do_train(parsed_cfg, parsed_cli_args)
[rank3]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/home/administrator/axolotl/src/axolotl/cli/train.py", line 66, in do_train
[rank3]:     return train(cfg=cfg, cli_args=cli_args, dataset_meta=dataset_meta)
[rank3]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/home/administrator/axolotl/src/axolotl/train.py", line 88, in train
[rank3]:     model, peft_config = load_model(cfg, tokenizer, inference=cli_args.inference)
[rank3]:                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/home/administrator/axolotl/src/axolotl/utils/models.py", line 822, in load_model
[rank3]:     model.to(f"cuda:{cfg.local_rank}")
[rank3]:   File "/home/administrator/miniconda3/envs/axolotl/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1173, in to
[rank3]:     return self._apply(convert)
[rank3]:            ^^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/home/administrator/miniconda3/envs/axolotl/lib/python3.11/site-packages/torch/nn/modules/module.py", line 779, in _apply
[rank3]:     module._apply(fn)
[rank3]:   File "/home/administrator/miniconda3/envs/axolotl/lib/python3.11/site-packages/torch/nn/modules/module.py", line 779, in _apply
[rank3]:     module._apply(fn)
[rank3]:   File "/home/administrator/miniconda3/envs/axolotl/lib/python3.11/site-packages/torch/nn/modules/module.py", line 779, in _apply
[rank3]:     module._apply(fn)
[rank3]:   [Previous line repeated 5 more times]
[rank3]:   File "/home/administrator/miniconda3/envs/axolotl/lib/python3.11/site-packages/torch/nn/modules/module.py", line 804, in _apply
[rank3]:     param_applied = fn(param)
[rank3]:                     ^^^^^^^^^
[rank3]:   File "/home/administrator/miniconda3/envs/axolotl/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1159, in convert
[rank3]:     return t.to(
[rank3]:            ^^^^^
[rank3]:   File "/home/administrator/miniconda3/envs/axolotl/lib/python3.11/site-packages/bitsandbytes/nn/modules.py", line 330, in to
[rank3]:     super().to(device=device, dtype=dtype, non_blocking=non_blocking),
[rank3]:     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]: torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU  has a total capacity of 23.68 GiB of which 60.31 MiB is free. Including non-PyTorch memory, this process has 23.61 GiB memory in use. Of the allocated memory 23.15 GiB is allocated by PyTorch, and 64.54 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
[rank1]: Traceback (most recent call last):
[rank1]:   File "<frozen runpy>", line 198, in _run_module_as_main
[rank1]:   File "<frozen runpy>", line 88, in _run_code
[rank1]:   File "/home/administrator/axolotl/src/axolotl/cli/train.py", line 70, in <module>
[rank1]:     fire.Fire(do_cli)
[rank1]:   File "/home/administrator/miniconda3/envs/axolotl/lib/python3.11/site-packages/fire/core.py", line 143, in Fire
[rank1]:     component_trace = _Fire(component, args, parsed_flag_args, context, name)
[rank1]:                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/home/administrator/miniconda3/envs/axolotl/lib/python3.11/site-packages/fire/core.py", line 477, in _Fire
[rank1]:     component, remaining_args = _CallAndUpdateTrace(
[rank1]:                                 ^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/home/administrator/miniconda3/envs/axolotl/lib/python3.11/site-packages/fire/core.py", line 693, in _CallAndUpdateTrace
[rank1]:     component = fn(*varargs, **kwargs)
[rank1]:                 ^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/home/administrator/axolotl/src/axolotl/cli/train.py", line 38, in do_cli
[rank1]:     return do_train(parsed_cfg, parsed_cli_args)
[rank1]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/home/administrator/axolotl/src/axolotl/cli/train.py", line 66, in do_train
[rank1]:     return train(cfg=cfg, cli_args=cli_args, dataset_meta=dataset_meta)
[rank1]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/home/administrator/axolotl/src/axolotl/train.py", line 88, in train
[rank1]:     model, peft_config = load_model(cfg, tokenizer, inference=cli_args.inference)
[rank1]:                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/home/administrator/axolotl/src/axolotl/utils/models.py", line 822, in load_model
[rank1]:     model.to(f"cuda:{cfg.local_rank}")
[rank1]:   File "/home/administrator/miniconda3/envs/axolotl/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1173, in to
[rank1]:     return self._apply(convert)
[rank1]:            ^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/home/administrator/miniconda3/envs/axolotl/lib/python3.11/site-packages/torch/nn/modules/module.py", line 779, in _apply
[rank1]:     module._apply(fn)
[rank1]:   File "/home/administrator/miniconda3/envs/axolotl/lib/python3.11/site-packages/torch/nn/modules/module.py", line 779, in _apply
[rank1]:     module._apply(fn)
[rank1]:   File "/home/administrator/miniconda3/envs/axolotl/lib/python3.11/site-packages/torch/nn/modules/module.py", line 779, in _apply
[rank1]:     module._apply(fn)
[rank1]:   [Previous line repeated 5 more times]
[rank1]:   File "/home/administrator/miniconda3/envs/axolotl/lib/python3.11/site-packages/torch/nn/modules/module.py", line 804, in _apply
[rank1]:     param_applied = fn(param)
[rank1]:                     ^^^^^^^^^
[rank1]:   File "/home/administrator/miniconda3/envs/axolotl/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1159, in convert
[rank1]:     return t.to(
[rank1]:            ^^^^^
[rank1]:   File "/home/administrator/miniconda3/envs/axolotl/lib/python3.11/site-packages/bitsandbytes/nn/modules.py", line 330, in to
[rank1]:     super().to(device=device, dtype=dtype, non_blocking=non_blocking),
[rank1]:     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU  has a total capacity of 23.68 GiB of which 60.31 MiB is free. Including non-PyTorch memory, this process has 23.61 GiB memory in use. Of the allocated memory 23.15 GiB is allocated by PyTorch, and 64.54 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
[rank0]: Traceback (most recent call last):
[rank0]:   File "<frozen runpy>", line 198, in _run_module_as_main
[rank0]:   File "<frozen runpy>", line 88, in _run_code
[rank0]:   File "/home/administrator/axolotl/src/axolotl/cli/train.py", line 70, in <module>
[rank0]:     fire.Fire(do_cli)
[rank0]:   File "/home/administrator/miniconda3/envs/axolotl/lib/python3.11/site-packages/fire/core.py", line 143, in Fire
[rank0]:     component_trace = _Fire(component, args, parsed_flag_args, context, name)
[rank0]:                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/administrator/miniconda3/envs/axolotl/lib/python3.11/site-packages/fire/core.py", line 477, in _Fire
[rank0]:     component, remaining_args = _CallAndUpdateTrace(
[rank0]:                                 ^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/administrator/miniconda3/envs/axolotl/lib/python3.11/site-packages/fire/core.py", line 693, in _CallAndUpdateTrace
[rank0]:     component = fn(*varargs, **kwargs)
[rank0]:                 ^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/administrator/axolotl/src/axolotl/cli/train.py", line 38, in do_cli
[rank0]:     return do_train(parsed_cfg, parsed_cli_args)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/administrator/axolotl/src/axolotl/cli/train.py", line 66, in do_train
[rank0]:     return train(cfg=cfg, cli_args=cli_args, dataset_meta=dataset_meta)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/administrator/axolotl/src/axolotl/train.py", line 88, in train
[rank0]:     model, peft_config = load_model(cfg, tokenizer, inference=cli_args.inference)
[rank0]:                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/administrator/axolotl/src/axolotl/utils/models.py", line 822, in load_model
[rank0]:     model.to(f"cuda:{cfg.local_rank}")
[rank0]:   File "/home/administrator/miniconda3/envs/axolotl/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1173, in to
[rank0]:     return self._apply(convert)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/administrator/miniconda3/envs/axolotl/lib/python3.11/site-packages/torch/nn/modules/module.py", line 779, in _apply
[rank0]:     module._apply(fn)
[rank0]:   File "/home/administrator/miniconda3/envs/axolotl/lib/python3.11/site-packages/torch/nn/modules/module.py", line 779, in _apply
[rank0]:     module._apply(fn)
[rank0]:   File "/home/administrator/miniconda3/envs/axolotl/lib/python3.11/site-packages/torch/nn/modules/module.py", line 779, in _apply
[rank0]:     module._apply(fn)
[rank0]:   [Previous line repeated 5 more times]
[rank0]:   File "/home/administrator/miniconda3/envs/axolotl/lib/python3.11/site-packages/torch/nn/modules/module.py", line 804, in _apply
[rank0]:     param_applied = fn(param)
[rank0]:                     ^^^^^^^^^
[rank0]:   File "/home/administrator/miniconda3/envs/axolotl/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1159, in convert
[rank0]:     return t.to(
[rank0]:            ^^^^^
[rank0]:   File "/home/administrator/miniconda3/envs/axolotl/lib/python3.11/site-packages/bitsandbytes/nn/modules.py", line 330, in to
[rank0]:     super().to(device=device, dtype=dtype, non_blocking=non_blocking),
[rank0]:     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 
W0626 17:07:01.588000 136552701535296 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 4034 closing signal SIGTERM
W0626 17:07:01.589000 136552701535296 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 4037 closing signal SIGTERM
E0626 17:07:01.904000 136552701535296 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: 1) local_rank: 1 (pid: 4035) of binary: /home/administrator/miniconda3/envs/axolotl/bin/python
Traceback (most recent call last):
  File "/home/administrator/miniconda3/envs/axolotl/bin/accelerate", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/home/administrator/miniconda3/envs/axolotl/lib/python3.11/site-packages/accelerate/commands/accelerate_cli.py", line 46, in main
    args.func(args)
  File "/home/administrator/miniconda3/envs/axolotl/lib/python3.11/site-packages/accelerate/commands/launch.py", line 1073, in launch_command
    multi_gpu_launcher(args)
  File "/home/administrator/miniconda3/envs/axolotl/lib/python3.11/site-packages/accelerate/commands/launch.py", line 718, in multi_gpu_launcher
    distrib_run.run(args)
  File "/home/administrator/miniconda3/envs/axolotl/lib/python3.11/site-packages/torch/distributed/run.py", line 870, in run
    elastic_launch(
  File "/home/administrator/miniconda3/envs/axolotl/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/administrator/miniconda3/envs/axolotl/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 263, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
axolotl.cli.train FAILED
------------------------------------------------------------
Failures:
[1]:
  time      : 2024-06-26_17:07:01
  host      : Super-Server
  rank      : 2 (local_rank: 2)
  exitcode  : 1 (pid: 4036)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-06-26_17:07:01
  host      : Super-Server
  rank      : 1 (local_rank: 1)
  exitcode  : 1 (pid: 4035)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

Steps to reproduce

Try to train a 70B model and use FSDP or Deepspeed Zero3.

Config yaml

base_model: /home/administrator/models/Meta-Llama-3-70B-Instruct-abliterated-v3.5
model_type: LlamaForCausalLM
tokenizer_type: AutoTokenizer
  
train_on_inputs: false
group_by_length: false
load_in_8bit: false
load_in_4bit: true
strict: false
sequence_len: 512
bf16: auto
fp16:
tf32: false
flash_attention: true

# Data
datasets:
  - path: /home/administrator/datasets/no-robots-sharegpt-fixed.jsonl
    type: sharegpt
    conversation: llama-3
 
warmup_steps: 10
dataset_prepared_path: ./last_run_prepared

# Iterations
num_epochs: 1
saves_per_epoch: 1

# Evaluation
val_set_size: 0.1
eval_table_size:
eval_table_max_new_tokens:
eval_sample_packing: false
evals_per_epoch: 1

# LoRA
output_dir: ./qlora-out
adapter: qlora
lora_model_dir:
lora_r: 8
lora_alpha: 16
lora_dropout: 0.05
lora_target_linear: true
lora_fan_in_fan_out:
lora_target_modules:
save_safetensors: true

# Sampling
sample_packing: false
pad_to_sequence_len: true

# Batching
gradient_accumulation_steps: 4
micro_batch_size: 1
gradient_checkpointing: true
gradient_checkpointing_kwargs:
  use_reentrant: true

# wandb
wandb_mode: disabled # "offline" to save run metadata locally and not sync to the server, "disabled" to turn off wandb
wandb_project: llama-3
wandb_entity: # A wandb Team name if using a Team
wandb_watch:
wandb_name: 
wandb_run_id: # Set the ID of your wandb run
wandb_log_model: # "checkpoint" to log model to wandb Artifacts every `save_steps` or "end" to log only at the end of training

# Optimizer
optimizer: adamw_torch
lr_scheduler: cosine
learning_rate: 0.0002

# Misc
early_stopping_patience:
resume_from_checkpoint:
logging_steps: 1
debug:
deepspeed:

fsdp:
  - full_shard
  - auto_wrap
fsdp_config:
  fsdp_limit_all_gathers: true
  fsdp_sync_module_states: true
  fsdp_offload_params: true
  fsdp_use_orig_params: false
  fsdp_cpu_ram_efficient_loading: true
  fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
  fsdp_transformer_layer_cls_to_wrap: LlamaDecoderLayer
  fsdp_state_dict_type: FULL_STATE_DICT
  fsdp_sharding_strategy: FULL_SHARD

weight_decay: 0.1
special_tokens:
  eos_token: "<|eot_id|>"
  pad_token: "<|end_of_text|>"

Possible solution

No response

Which Operating Systems are you using?

  • Linux
  • macOS
  • Windows

Python Version

3.11

axolotl branch-commit

main/4d6490b

Acknowledgements

  • My issue title is concise, descriptive, and in title casing.
  • I have searched the existing issues to make sure this bug has not been reported yet.
  • I am using the latest version of axolotl.
  • I have provided enough information for the maintainers to reproduce and diagnose the issue.
Screenshot 2024-07-11 at 4 10 58 PM This is on 4x48GB and while it settles out at around 10GB/GPU utilization during training, it for some reason has a large requirement to load the model. I'll keep digging into this.

Screenshot 2024-07-11 at 4 10 58 PM This is on 4x48GB and while it settles out at around 10GB/GPU utilization during training, it for some reason has a large requirement to load the model. I'll keep digging into this.

Thanks for looking into this winglian. If you could get this to work it would help speed up 70B training for us 24GB GPU users so much haha.

Also deepspeed zero3 just doesn't shard the model if it's loaded in 4-bit? Is that just how it is? I remember a thread asking about this but don't remember if the discussion got anywhere.

The priority here is to make it not OOM. Not to make it any faster.

@Nero10578 can you remove the model_type field from your yaml please? Line 2.

The priority here is to make it not OOM. Not to make it any faster.

Yea I was saying that because if it OOM I usually just resort running it without accelerate which is dog slow.

@Nero10578 can you remove the model_type field from your yaml please? Line 2.

Ok I will try.

@Nero10578 actually, #1742 should fix this for you.

@Nero10578 actually, #1742 should fix this for you.

Okay awesome will give it a shot thanks! I haven't had time to yet.

What about for QLORA 4-bit using Zero3, does that also work now? Right now Zero3 only seems to shard the model when doing full training or LORA.

I'll tackle zero3 next week 👍

@Nero10578 maybe set ACCELERATE_DEEPSPEED_ZERO3_INIT=true in your environment?

@Nero10578 can you remove the model_type field from your yaml please? Line 2.

I don't think doing this made a difference in memory use.

@Nero10578 maybe set ACCELERATE_DEEPSPEED_ZERO3_INIT=true in your environment?

This also didn't make a difference.

Currently I only have access to a 2x24GB RTX 3090Ti machine since I was only testing the 4x3090 machine for a limited time.

So on my 2x3090Ti machine, when training with deepspeed, when it is loading the model at first I can see both my RTX 3090 Tis get loaded with ~6GB when training with load_in_4bit: true and qlora and then with ~11GB when loading a Llama 3.1 8B model training in lora. Doesn't that seem like it is just loading the model fully to both GPUs?

Capture

I am on windows WSL2 ubuntu, but I don't think that should be a problem right? NCCL Tests works and transfers between cards is great at 44GB/s since I am using NVLink. Although it does prevent me from using nvtop to show the GPU memory use, so I am showing GPU-Z for this.

This limits 2x24GB training of Llama 8B models to 4096 sequence length on lora and 8192 sequence length on qlora when using deepspeed.

I am not sure where I am going wrong with my config for 2x24GB GPU Llama 3.1 8B 8192ctx LORA using deepspeed:

base_model: /home/owen/models/Meta-Llama-3.1-8B-Instruct
# model_type: LlamaForCausalLM
tokenizer_type: AutoTokenizer
  
train_on_inputs: false
group_by_length: false
load_in_8bit: false
load_in_4bit: false
strict: false
sequence_len: 8192
bf16: auto
fp16: 
tf32: false
flash_attention: true

shuffle_merged_datasets: true

# Data
datasets:
  - path: /home/owen/datasets/train.jsonl
    type: sharegpt
    conversation: llama-3
  
warmup_steps: 10
dataset_prepared_path: ./last_run_prepared

# Iterations
num_epochs: 1
saves_per_epoch: 4

# Evaluation
val_set_size: 0.0025
eval_table_size:
eval_table_max_new_tokens: 128
eval_sample_packing: false
evals_per_epoch: 8

# LoRA
output_dir: ./lora-out
adapter: lora
lora_model_dir:
lora_r: 8
lora_alpha: 16
lora_dropout: 0.05
lora_target_linear: true
lora_fan_in_fan_out:
lora_target_modules:
save_safetensors: true

# Sampling
sample_packing: true
pad_to_sequence_len: true

# Batching
gradient_accumulation_steps: 32
micro_batch_size: 1
gradient_checkpointing: true
gradient_checkpointing_kwargs:
  use_reentrant: true

# wandb
wandb_mode: # "offline" to save run metadata locally and not sync to the server, "disabled" to turn off wandb
wandb_project: formax
wandb_entity: # A wandb Team name if using a Team
wandb_watch:
wandb_name: 8-16-8192-v0.1
wandb_run_id: # Set the ID of your wandb run
wandb_log_model: # "checkpoint" to log model to wandb Artifacts every `save_steps` or "end" to log only at the end of training

# Optimizer
optimizer: adamw_bnb_8bit
lr_scheduler: cosine
learning_rate: 0.00001

# Misc
early_stopping_patience:
auto_resume_from_checkpoints: true
logging_steps: 1
debug:
weight_decay: 0.1
special_tokens:
  eos_token: "<|eot_id|>"
  pad_token: "<|end_of_text|>"

# Multi-GPU
deepspeed: /home/owen/axolotl/deepspeed_configs/zero3_bf16.json

2x24GB Llama 3.1 8B 8192ctx LORA Deepspeed Traceback:

ACCELERATE_DEEPSPEED_ZERO3_INIT=true accelerate launch -m axolotl.cli.train lora-sft-deepspeed.yml
The following values were not passed to `accelerate launch` and had defaults used instead:
        `--num_processes` was set to a value of `2`
                More than one GPU was found, enabling multi-GPU training.
                If this was unintended please pass in `--num_processes=1`.
        `--num_machines` was set to a value of `1`
        `--mixed_precision` was set to a value of `'no'`
        `--dynamo_backend` was set to a value of `'no'`
To avoid this warning pass in values for each of the problematic parameters or run `accelerate config`.
[2024-07-26 04:38:51,377] [INFO] [datasets.<module>:58] [PID:2411323] PyTorch version 2.3.1 available.
[2024-07-26 04:38:51,478] [INFO] [datasets.<module>:58] [PID:2411322] PyTorch version 2.3.1 available.
[2024-07-26 04:38:52,528] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-07-26 04:38:52,595] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-07-26 04:38:52,595] [INFO] [root.spawn:38] [PID:2411323] gcc -pthread -B /home/owen/miniconda3/envs/axolotl/compiler_compat -DNDEBUG -fwrapv -O2 -Wall -fPIC -O2 -isystem /home/owen/miniconda3/envs/axolotl/include -fPIC -O2 -isystem /home/owen/miniconda3/envs/axolotl/include -fPIC -c /tmp/tmpoe2wwdcr/test.c -o /tmp/tmpoe2wwdcr/test.o
[2024-07-26 04:38:52,609] [INFO] [root.spawn:38] [PID:2411323] gcc -pthread -B /home/owen/miniconda3/envs/axolotl/compiler_compat /tmp/tmpoe2wwdcr/test.o -laio -o /tmp/tmpoe2wwdcr/a.out
[2024-07-26 04:38:52,660] [INFO] [root.spawn:38] [PID:2411322] gcc -pthread -B /home/owen/miniconda3/envs/axolotl/compiler_compat -DNDEBUG -fwrapv -O2 -Wall -fPIC -O2 -isystem /home/owen/miniconda3/envs/axolotl/include -fPIC -O2 -isystem /home/owen/miniconda3/envs/axolotl/include -fPIC -c /tmp/tmpmrvskbmy/test.c -o /tmp/tmpmrvskbmy/test.o
[2024-07-26 04:38:52,677] [INFO] [root.spawn:38] [PID:2411322] gcc -pthread -B /home/owen/miniconda3/envs/axolotl/compiler_compat /tmp/tmpmrvskbmy/test.o -laio -o /tmp/tmpmrvskbmy/a.out
 [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.
 [WARNING]  async_io: please install the libaio-dev package with apt
 [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
 [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
 [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.3
 [WARNING]  using untested triton version (2.3.1), only 1.0.0 is known to be compatible
 [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.
 [WARNING]  async_io: please install the libaio-dev package with apt
 [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
 [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
 [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.3
 [WARNING]  using untested triton version (2.3.1), only 1.0.0 is known to be compatible
[2024-07-26 04:38:54,461] [INFO] [axolotl.utils.config.models.input.check_eval_packing:961] [PID:2411323] [RANK:1] setting `remove_unused_columns: false` for when sample_packing and eval_sample_packing don't match
[2024-07-26 04:38:54,462] [WARNING] [axolotl.utils.config.models.input.hint_lora_8bit:1047] [PID:2411323] [RANK:1] We recommend setting `load_in_8bit: true` for LORA finetuning
[2024-07-26 04:38:54,462] [DEBUG] [axolotl.normalize_config:80] [PID:2411323] [RANK:1] bf16 support detected, enabling for this configuration.
[2024-07-26 04:38:54,464] [INFO] [axolotl.normalize_config:183] [PID:2411323] [RANK:1] GPU memory usage baseline: 0.000GB (+0.499GB misc)
[2024-07-26 04:38:54,467] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-07-26 04:38:54,574] [INFO] [axolotl.utils.config.models.input.check_eval_packing:961] [PID:2411322] [RANK:0] setting `remove_unused_columns: false` for when sample_packing and eval_sample_packing don't match
[2024-07-26 04:38:54,575] [WARNING] [axolotl.utils.config.models.input.hint_lora_8bit:1047] [PID:2411322] [RANK:0] We recommend setting `load_in_8bit: true` for LORA finetuning
[2024-07-26 04:38:54,575] [DEBUG] [axolotl.normalize_config:80] [PID:2411322] [RANK:0] bf16 support detected, enabling for this configuration.
[2024-07-26 04:38:54,578] [INFO] [axolotl.normalize_config:183] [PID:2411322] [RANK:0] GPU memory usage baseline: 0.000GB (+0.499GB misc)
[2024-07-26 04:38:54,580] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-07-26 04:38:54,580] [INFO] [comm.py:668:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
                                 dP            dP   dP
                                 88            88   88
      .d8888b. dP.  .dP .d8888b. 88 .d8888b. d8888P 88
      88'  `88  `8bd8'  88'  `88 88 88'  `88   88   88
      88.  .88  .d88b.  88.  .88 88 88.  .88   88   88
      `88888P8 dP'  `dP `88888P' dP `88888P'   dP   dP



****************************************
**** Axolotl Dependency Versions *****
  accelerate: 0.32.0
        peft: 0.11.1
transformers: 4.43.1
         trl: 0.9.6
       torch: 2.3.1
bitsandbytes: 0.43.1
****************************************
[2024-07-26 04:38:54,684] [WARNING] [axolotl.scripts.check_user_token:487] [PID:2411322] [RANK:0] Error verifying HuggingFace token. Remember to log in using `huggingface-cli login` and get your access token from https://huggingface.co/settings/tokens if you want to use gated models or datasets.
[2024-07-26 04:38:54,830] [WARNING] [axolotl.scripts.check_user_token:487] [PID:2411323] [RANK:1] Error verifying HuggingFace token. Remember to log in using `huggingface-cli login` and get your access token from https://huggingface.co/settings/tokens if you want to use gated models or datasets.
[2024-07-26 04:38:55,026] [DEBUG] [axolotl.load_tokenizer:280] [PID:2411322] [RANK:0] EOS: 128009 / <|eot_id|>
[2024-07-26 04:38:55,026] [DEBUG] [axolotl.load_tokenizer:281] [PID:2411322] [RANK:0] BOS: 128000 / <|begin_of_text|>
[2024-07-26 04:38:55,026] [DEBUG] [axolotl.load_tokenizer:282] [PID:2411322] [RANK:0] PAD: 128001 / <|end_of_text|>
[2024-07-26 04:38:55,026] [DEBUG] [axolotl.load_tokenizer:283] [PID:2411322] [RANK:0] UNK: None / None
[2024-07-26 04:38:55,026] [INFO] [axolotl.load_tokenizer:294] [PID:2411322] [RANK:0] No Chat template selected. Consider adding a chat template for easier inference.
[2024-07-26 04:38:55,026] [INFO] [axolotl.load_tokenized_prepared_datasets:183] [PID:2411322] [RANK:0] Unable to find prepared dataset in last_run_prepared/d2b6bc4a02f25a82fc1d087d95a9dd9d
[2024-07-26 04:38:55,026] [INFO] [axolotl.load_tokenized_prepared_datasets:184] [PID:2411322] [RANK:0] Loading raw datasets...
[2024-07-26 04:38:55,026] [WARNING] [axolotl.load_tokenized_prepared_datasets:186] [PID:2411322] [RANK:0] Processing datasets during training can lead to VRAM instability. Please pre-process your dataset.
[2024-07-26 04:38:55,026] [INFO] [axolotl.load_tokenized_prepared_datasets:193] [PID:2411322] [RANK:0] No seed provided, using default seed of 42
[2024-07-26 04:38:55,262] [DEBUG] [axolotl.load_tokenizer:280] [PID:2411323] [RANK:1] EOS: 128009 / <|eot_id|>
[2024-07-26 04:38:55,262] [DEBUG] [axolotl.load_tokenizer:281] [PID:2411323] [RANK:1] BOS: 128000 / <|begin_of_text|>
[2024-07-26 04:38:55,262] [DEBUG] [axolotl.load_tokenizer:282] [PID:2411323] [RANK:1] PAD: 128001 / <|end_of_text|>
[2024-07-26 04:38:55,262] [DEBUG] [axolotl.load_tokenizer:283] [PID:2411323] [RANK:1] UNK: None / None
[2024-07-26 04:38:55,262] [INFO] [axolotl.load_tokenizer:294] [PID:2411323] [RANK:1] No Chat template selected. Consider adding a chat template for easier inference.
[2024-07-26 04:38:56,519] [INFO] [axolotl.get_dataset_wrapper:540] [PID:2411322] [RANK:0] Loading dataset with base_type: sharegpt and prompt_style: None
[2024-07-26 04:38:57,759] [INFO] [axolotl.get_dataset_wrapper:540] [PID:2411322] [RANK:0] Loading dataset with base_type: sharegpt and prompt_style: None
[2024-07-26 04:38:59,009] [INFO] [axolotl.get_dataset_wrapper:540] [PID:2411322] [RANK:0] Loading dataset with base_type: sharegpt and prompt_style: None
[2024-07-26 04:39:00,379] [INFO] [axolotl.get_dataset_wrapper:540] [PID:2411322] [RANK:0] Loading dataset with base_type: sharegpt and prompt_style: None
[2024-07-26 04:39:01,871] [INFO] [axolotl.get_dataset_wrapper:540] [PID:2411322] [RANK:0] Loading dataset with base_type: sharegpt and prompt_style: None
[2024-07-26 04:39:02,146] [INFO] [axolotl.load_tokenized_prepared_datasets:414] [PID:2411322] [RANK:0] merging datasets
[2024-07-26 04:39:02,192] [DEBUG] [axolotl.load_tokenized_prepared_datasets:419] [PID:2411322] [RANK:0] shuffle merged datasets
[2024-07-26 04:39:03,798] [INFO] [axolotl.load_tokenized_prepared_datasets:183] [PID:2411323] [RANK:1] Unable to find prepared dataset in last_run_prepared/d2b6bc4a02f25a82fc1d087d95a9dd9d
[2024-07-26 04:39:03,798] [INFO] [axolotl.load_tokenized_prepared_datasets:184] [PID:2411323] [RANK:1] Loading raw datasets...
[2024-07-26 04:39:03,798] [WARNING] [axolotl.load_tokenized_prepared_datasets:186] [PID:2411323] [RANK:1] Processing datasets during training can lead to VRAM instability. Please pre-process your dataset.
[2024-07-26 04:39:03,798] [INFO] [axolotl.load_tokenized_prepared_datasets:193] [PID:2411323] [RANK:1] No seed provided, using default seed of 42
[2024-07-26 04:39:03,799] [INFO] [axolotl.load_tokenized_prepared_datasets:427] [PID:2411322] [RANK:0] Saving merged prepared dataset to disk... last_run_prepared/d2b6bc4a02f25a82fc1d087d95a9dd9d
Saving the dataset (1/9 shards):  12%|███▉                             | 62657/527905 [00:00<00:05, 80409.85 examples/s][2024-07-26 04:39:04,759] [INFO] [axolotl.get_dataset_wrapper:540] [PID:2411323] [RANK:1] Loading dataset with base_type: sharegpt and prompt_style: None
Saving the dataset (2/9 shards):  33%|██████████▌                     | 174313/527905 [00:02<00:04, 80947.34 examples/s][2024-07-26 04:39:05,967] [INFO] [axolotl.get_dataset_wrapper:540] [PID:2411323] [RANK:1] Loading dataset with base_type: sharegpt and prompt_style: None
Saving the dataset (5/9 shards):  56%|█████████████████▊              | 293281/527905 [00:03<00:02, 82660.60 examples/s][2024-07-26 04:39:07,354] [INFO] [axolotl.get_dataset_wrapper:540] [PID:2411323] [RANK:1] Loading dataset with base_type: sharegpt and prompt_style: None
Saving the dataset (7/9 shards):  78%|████████████████████████▉       | 410593/527905 [00:04<00:01, 89631.61 examples/s][2024-07-26 04:39:08,765] [INFO] [axolotl.get_dataset_wrapper:540] [PID:2411323] [RANK:1] Loading dataset with base_type: sharegpt and prompt_style: None
Saving the dataset (8/9 shards):  98%|███████████████████████████████▍| 518249/527905 [00:06<00:00, 84044.92 examples/s][2024-07-26 04:39:10,013] [INFO] [axolotl.get_dataset_wrapper:540] [PID:2411323] [RANK:1] Loading dataset with base_type: sharegpt and prompt_style: None
Saving the dataset (9/9 shards): 100%|████████████████████████████████| 527905/527905 [00:06<00:00, 84500.14 examples/s]
[2024-07-26 04:39:10,242] [INFO] [axolotl.load_tokenized_prepared_datasets:414] [PID:2411323] [RANK:1] merging datasets
[2024-07-26 04:39:10,282] [DEBUG] [axolotl.load_tokenized_prepared_datasets:419] [PID:2411323] [RANK:1] shuffle merged datasets
[2024-07-26 04:39:10,965] [DEBUG] [axolotl.calculate_total_num_steps:297] [PID:2411322] [RANK:0] total_num_tokens: 193_555_279
[2024-07-26 04:39:16,440] [DEBUG] [axolotl.calculate_total_num_steps:310] [PID:2411322] [RANK:0] `total_supervised_tokens: 73_843_143`
[2024-07-26 04:39:22,449] [INFO] [axolotl.utils.samplers.multipack._len_est:185] [PID:2411322] [RANK:0] packing_efficiency_estimate: 1.0 total_num_tokens per device: 96777639
[2024-07-26 04:39:22,449] [DEBUG] [axolotl.calculate_total_num_steps:362] [PID:2411322] [RANK:0] data_loader_len: 365
[2024-07-26 04:39:23,150] [INFO] [axolotl.utils.samplers.multipack._len_est:185] [PID:2411323] [RANK:1] packing_efficiency_estimate: 1.0 total_num_tokens per device: 96777639
[2024-07-26 04:39:23,238] [INFO] [axolotl.calc_sample_packing_eff_est:368] [PID:2411322] [RANK:0] sample_packing_eff_est across ranks: [0.949004054069519, 0.9498433470726013]
[2024-07-26 04:39:23,240] [DEBUG] [axolotl.calculate_total_num_steps:380] [PID:2411322] [RANK:0] sample_packing_eff_est: 0.95
[2024-07-26 04:39:23,240] [DEBUG] [axolotl.calculate_total_num_steps:388] [PID:2411322] [RANK:0] total_num_steps: 365
[2024-07-26 04:39:23,267] [DEBUG] [axolotl.train.train:66] [PID:2411322] [RANK:0] loading tokenizer... /home/owen/models/Meta-Llama-3.1-8B-Instruct
[2024-07-26 04:39:23,615] [DEBUG] [axolotl.load_tokenizer:280] [PID:2411323] [RANK:1] EOS: 128009 / <|eot_id|>
[2024-07-26 04:39:23,615] [DEBUG] [axolotl.load_tokenizer:281] [PID:2411323] [RANK:1] BOS: 128000 / <|begin_of_text|>
[2024-07-26 04:39:23,615] [DEBUG] [axolotl.load_tokenizer:282] [PID:2411323] [RANK:1] PAD: 128001 / <|end_of_text|>
[2024-07-26 04:39:23,615] [DEBUG] [axolotl.load_tokenizer:283] [PID:2411323] [RANK:1] UNK: None / None
[2024-07-26 04:39:23,615] [INFO] [axolotl.load_tokenizer:294] [PID:2411323] [RANK:1] No Chat template selected. Consider adding a chat template for easier inference.
[2024-07-26 04:39:23,615] [DEBUG] [axolotl.load_tokenizer:280] [PID:2411322] [RANK:0] EOS: 128009 / <|eot_id|>
[2024-07-26 04:39:23,616] [DEBUG] [axolotl.load_tokenizer:281] [PID:2411322] [RANK:0] BOS: 128000 / <|begin_of_text|>
[2024-07-26 04:39:23,616] [DEBUG] [axolotl.load_tokenizer:282] [PID:2411322] [RANK:0] PAD: 128001 / <|end_of_text|>
[2024-07-26 04:39:23,616] [DEBUG] [axolotl.load_tokenizer:283] [PID:2411322] [RANK:0] UNK: None / None
[2024-07-26 04:39:23,616] [INFO] [axolotl.load_tokenizer:294] [PID:2411322] [RANK:0] No Chat template selected. Consider adding a chat template for easier inference.
[2024-07-26 04:39:23,616] [DEBUG] [axolotl.train.train:95] [PID:2411322] [RANK:0] loading model and peft_config...
You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`.
You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`.
[2024-07-26 04:39:26,262] [INFO] [partition_parameters.py:345:__exit__] finished initializing model - num_params = 291, num_elems = 8.03B
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████| 4/4 [00:01<00:00,  3.09it/s]
[2024-07-26 04:39:27,624] [INFO] [axolotl.load_model:764] [PID:2411323] [RANK:1] GPU memory usage after model load: 7.481GB (+12.504GB cache)
[2024-07-26 04:39:27,629] [INFO] [axolotl.load_model:824] [PID:2411323] [RANK:1] converting modules to torch.bfloat16 for flash attention
[2024-07-26 04:39:27,633] [INFO] [axolotl.load_lora:986] [PID:2411323] [RANK:1] found linear modules: ['o_proj', 'gate_proj', 'up_proj', 'v_proj', 'q_proj', 'k_proj', 'down_proj']
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████| 4/4 [00:04<00:00,  1.02s/it]
[2024-07-26 04:39:30,420] [INFO] [axolotl.load_model:764] [PID:2411322] [RANK:0] GPU memory usage after model load: 7.481GB (+2.125GB cache, +1.031GB misc)
[2024-07-26 04:39:30,425] [INFO] [axolotl.load_model:824] [PID:2411322] [RANK:0] converting modules to torch.bfloat16 for flash attention
[2024-07-26 04:39:30,428] [INFO] [axolotl.load_lora:986] [PID:2411322] [RANK:0] found linear modules: ['v_proj', 'k_proj', 'gate_proj', 'up_proj', 'down_proj', 'q_proj', 'o_proj']
[2024-07-26 04:39:30,686] [INFO] [axolotl.load_model:869] [PID:2411323] [RANK:1] GPU memory usage after adapters: 7.520GB (+12.504GB cache)
trainable params: 20,971,520 || all params: 8,051,232,768 || trainable%: 0.2605
[2024-07-26 04:39:30,909] [INFO] [axolotl.load_model:869] [PID:2411322] [RANK:0] GPU memory usage after adapters: 7.520GB (+2.125GB cache, +1.031GB misc)
/home/owen/miniconda3/envs/axolotl/lib/python3.11/site-packages/transformers/training_args.py:1525: FutureWarning: `evaluation_strategy` is deprecated and will be removed in version 4.46 of 🤗 Transformers. Use `eval_strategy` instead
  warnings.warn(
/home/owen/miniconda3/envs/axolotl/lib/python3.11/site-packages/transformers/training_args.py:1525: FutureWarning: `evaluation_strategy` is deprecated and will be removed in version 4.46 of 🤗 Transformers. Use `eval_strategy` instead
  warnings.warn(
[2024-07-26 04:39:31,600] [INFO] [axolotl.train.train:136] [PID:2411322] [RANK:0] Pre-saving adapter config to ./lora-out
[2024-07-26 04:39:31,762] [INFO] [axolotl.train.train:173] [PID:2411322] [RANK:0] Starting trainer...
[2024-07-26 04:39:32,197] [INFO] [axolotl.utils.samplers.multipack._len_est:185] [PID:2411323] [RANK:1] packing_efficiency_estimate: 0.95 total_num_tokens per device: 96777639
[2024-07-26 04:39:32,310] [INFO] [axolotl.utils.samplers.multipack._len_est:185] [PID:2411322] [RANK:0] packing_efficiency_estimate: 0.95 total_num_tokens per device: 96777639
[2024-07-26 04:39:32,524] [INFO] [axolotl.utils.samplers.multipack._len_est:185] [PID:2411323] [RANK:1] packing_efficiency_estimate: 0.95 total_num_tokens per device: 96777639
[2024-07-26 04:39:32,685] [INFO] [axolotl.utils.samplers.multipack._len_est:185] [PID:2411322] [RANK:0] packing_efficiency_estimate: 0.95 total_num_tokens per device: 96777639
[2024-07-26 04:39:33,069] [INFO] [axolotl.utils.samplers.multipack._len_est:185] [PID:2411323] [RANK:1] packing_efficiency_estimate: 0.95 total_num_tokens per device: 96777639
[2024-07-26 04:39:33,242] [INFO] [axolotl.utils.samplers.multipack._len_est:185] [PID:2411322] [RANK:0] packing_efficiency_estimate: 0.95 total_num_tokens per device: 96777639
[2024-07-26 04:39:33,401] [INFO] [axolotl.utils.samplers.multipack._len_est:185] [PID:2411323] [RANK:1] packing_efficiency_estimate: 0.95 total_num_tokens per device: 96777639
[2024-07-26 04:39:33,569] [INFO] [axolotl.utils.samplers.multipack._len_est:185] [PID:2411322] [RANK:0] packing_efficiency_estimate: 0.95 total_num_tokens per device: 96777639
[2024-07-26 04:39:33,688] [WARNING] [engine.py:1188:_do_optimizer_sanity_check] **** You are using ZeRO with an untested optimizer, proceed with caution *****
Parameter Offload: Total persistent parameters: 10227712 in 417 params
[2024-07-26 04:39:36,818] [INFO] [axolotl.utils.samplers.multipack._len_est:185] [PID:2411323] [RANK:1] packing_efficiency_estimate: 0.95 total_num_tokens per device: 96777639
[2024-07-26 04:39:37,147] [INFO] [axolotl.utils.samplers.multipack._len_est:185] [PID:2411323] [RANK:1] packing_efficiency_estimate: 0.95 total_num_tokens per device: 96777639
You're using a PreTrainedTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
wandb: Currently logged in as: owenarliawan. Use `wandb login --relogin` to force relogin
wandb: Tracking run with wandb version 0.17.5
wandb: Run data is saved locally in /home/owen/train-Llama-3.1-8B-Formax-v0.1/wandb/run-20240726_043938-eytn2p12
wandb: Run `wandb offline` to turn off syncing.
wandb: Syncing run 8-16-8192-v0.1
wandb: ⭐️ View project at https://wandb.ai/owenarliawan/formax
wandb: 🚀 View run at https://wandb.ai/owenarliawan/formax/runs/eytn2p12
wandb: WARNING Saving files without folders. If you want to preserve subdirectories pass base_path to wandb.save, i.e. wandb.save("/mnt/folder/file.h5", base_path="/mnt")
[2024-07-26 04:39:42,488] [INFO] [axolotl.callbacks.on_train_begin:785] [PID:2411322] [RANK:0] The Axolotl config has been saved to the WandB run under files.
  0%|                                                                                           | 0/384 [00:00<?, ?it/s][2024-07-26 04:39:42,866] [INFO] [axolotl.utils.samplers.multipack._len_est:185] [PID:2411322] [RANK:0] packing_efficiency_estimate: 0.95 total_num_tokens per device: 96777639
[2024-07-26 04:39:43,188] [INFO] [axolotl.utils.samplers.multipack._len_est:185] [PID:2411322] [RANK:0] packing_efficiency_estimate: 0.95 total_num_tokens per device: 96777639
You're using a PreTrainedTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/home/owen/axolotl/src/axolotl/cli/train.py", line 72, in <module>
    fire.Fire(do_cli)
  File "/home/owen/miniconda3/envs/axolotl/lib/python3.11/site-packages/fire/core.py", line 143, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/owen/miniconda3/envs/axolotl/lib/python3.11/site-packages/fire/core.py", line 477, in _Fire
    component, remaining_args = _CallAndUpdateTrace(
                                ^^^^^^^^^^^^^^^^^^^^
  File "/home/owen/miniconda3/envs/axolotl/lib/python3.11/site-packages/fire/core.py", line 693, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
                ^^^^^^^^^^^^^^^^^^^^^^
  File "/home/owen/axolotl/src/axolotl/cli/train.py", line 39, in do_cli
    return do_train(parsed_cfg, parsed_cli_args)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/owen/axolotl/src/axolotl/cli/train.py", line 67, in do_train
    return train(cfg=cfg, cli_args=cli_args, dataset_meta=dataset_meta)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/owen/axolotl/src/axolotl/train.py", line 187, in train
    trainer.train(resume_from_checkpoint=resume_from_checkpoint)
  File "/home/owen/miniconda3/envs/axolotl/lib/python3.11/site-packages/transformers/trainer.py", line 1938, in train
    return inner_training_loop(
           ^^^^^^^^^^^^^^^^^^^^
  File "/home/owen/miniconda3/envs/axolotl/lib/python3.11/site-packages/transformers/trainer.py", line 2279, in _inner_training_loop
    tr_loss_step = self.training_step(model, inputs)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/owen/miniconda3/envs/axolotl/lib/python3.11/site-packages/transformers/trainer.py", line 3349, in training_step
    self.accelerator.backward(loss, **kwargs)
  File "/home/owen/miniconda3/envs/axolotl/lib/python3.11/site-packages/accelerate/accelerator.py", line 2143, in backward
    self.deepspeed_engine_wrapped.backward(loss, **kwargs)
  File "/home/owen/miniconda3/envs/axolotl/lib/python3.11/site-packages/accelerate/utils/deepspeed.py", line 166, in backward
    self.engine.backward(loss, **kwargs)
  File "/home/owen/miniconda3/envs/axolotl/lib/python3.11/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
    ret_val = func(*args, **kwargs)
              ^^^^^^^^^^^^^^^^^^^^^
  File "/home/owen/miniconda3/envs/axolotl/lib/python3.11/site-packages/deepspeed/runtime/engine.py", line 1976, in backward
    self.optimizer.backward(loss, retain_graph=retain_graph)
  File "/home/owen/miniconda3/envs/axolotl/lib/python3.11/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
    ret_val = func(*args, **kwargs)
              ^^^^^^^^^^^^^^^^^^^^^
  File "/home/owen/miniconda3/envs/axolotl/lib/python3.11/site-packages/deepspeed/runtime/zero/stage3.py", line 2213, in backward
    self.loss_scaler.backward(loss.float(), retain_graph=retain_graph)
  File "/home/owen/miniconda3/envs/axolotl/lib/python3.11/site-packages/deepspeed/runtime/fp16/loss_scaler.py", line 63, in backward
    scaled_loss.backward(retain_graph=retain_graph)
  File "/home/owen/miniconda3/envs/axolotl/lib/python3.11/site-packages/torch/_tensor.py", line 525, in backward
    torch.autograd.backward(
  File "/home/owen/miniconda3/envs/axolotl/lib/python3.11/site-packages/torch/autograd/__init__.py", line 267, in backward
    _engine_run_backward(
  File "/home/owen/miniconda3/envs/axolotl/lib/python3.11/site-packages/torch/autograd/graph.py", line 744, in _engine_run_backward
    return Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 3.91 GiB. GPU
[rank0]: Traceback (most recent call last):
[rank0]:   File "<frozen runpy>", line 198, in _run_module_as_main
[rank0]:   File "<frozen runpy>", line 88, in _run_code
[rank0]:   File "/home/owen/axolotl/src/axolotl/cli/train.py", line 72, in <module>
[rank0]:     fire.Fire(do_cli)
[rank0]:   File "/home/owen/miniconda3/envs/axolotl/lib/python3.11/site-packages/fire/core.py", line 143, in Fire
[rank0]:     component_trace = _Fire(component, args, parsed_flag_args, context, name)
[rank0]:                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/owen/miniconda3/envs/axolotl/lib/python3.11/site-packages/fire/core.py", line 477, in _Fire
[rank0]:     component, remaining_args = _CallAndUpdateTrace(
[rank0]:                                 ^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/owen/miniconda3/envs/axolotl/lib/python3.11/site-packages/fire/core.py", line 693, in _CallAndUpdateTrace
[rank0]:     component = fn(*varargs, **kwargs)
[rank0]:                 ^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/owen/axolotl/src/axolotl/cli/train.py", line 39, in do_cli
[rank0]:     return do_train(parsed_cfg, parsed_cli_args)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/owen/axolotl/src/axolotl/cli/train.py", line 67, in do_train
[rank0]:     return train(cfg=cfg, cli_args=cli_args, dataset_meta=dataset_meta)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/owen/axolotl/src/axolotl/train.py", line 187, in train
[rank0]:     trainer.train(resume_from_checkpoint=resume_from_checkpoint)
[rank0]:   File "/home/owen/miniconda3/envs/axolotl/lib/python3.11/site-packages/transformers/trainer.py", line 1938, in train
[rank0]:     return inner_training_loop(
[rank0]:            ^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/owen/miniconda3/envs/axolotl/lib/python3.11/site-packages/transformers/trainer.py", line 2279, in _inner_training_loop
[rank0]:     tr_loss_step = self.training_step(model, inputs)
[rank0]:                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/owen/miniconda3/envs/axolotl/lib/python3.11/site-packages/transformers/trainer.py", line 3349, in training_step
[rank0]:     self.accelerator.backward(loss, **kwargs)
[rank0]:   File "/home/owen/miniconda3/envs/axolotl/lib/python3.11/site-packages/accelerate/accelerator.py", line 2143, in backward
[rank0]:     self.deepspeed_engine_wrapped.backward(loss, **kwargs)
[rank0]:   File "/home/owen/miniconda3/envs/axolotl/lib/python3.11/site-packages/accelerate/utils/deepspeed.py", line 166, in backward
[rank0]:     self.engine.backward(loss, **kwargs)
[rank0]:   File "/home/owen/miniconda3/envs/axolotl/lib/python3.11/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
[rank0]:     ret_val = func(*args, **kwargs)
[rank0]:               ^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/owen/miniconda3/envs/axolotl/lib/python3.11/site-packages/deepspeed/runtime/engine.py", line 1976, in backward
[rank0]:     self.optimizer.backward(loss, retain_graph=retain_graph)
[rank0]:   File "/home/owen/miniconda3/envs/axolotl/lib/python3.11/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
[rank0]:     ret_val = func(*args, **kwargs)
[rank0]:               ^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/owen/miniconda3/envs/axolotl/lib/python3.11/site-packages/deepspeed/runtime/zero/stage3.py", line 2213, in backward
[rank0]:     self.loss_scaler.backward(loss.float(), retain_graph=retain_graph)
[rank0]:   File "/home/owen/miniconda3/envs/axolotl/lib/python3.11/site-packages/deepspeed/runtime/fp16/loss_scaler.py", line 63, in backward
[rank0]:     scaled_loss.backward(retain_graph=retain_graph)
[rank0]:   File "/home/owen/miniconda3/envs/axolotl/lib/python3.11/site-packages/torch/_tensor.py", line 525, in backward
[rank0]:     torch.autograd.backward(
[rank0]:   File "/home/owen/miniconda3/envs/axolotl/lib/python3.11/site-packages/torch/autograd/__init__.py", line 267, in backward
[rank0]:     _engine_run_backward(
[rank0]:   File "/home/owen/miniconda3/envs/axolotl/lib/python3.11/site-packages/torch/autograd/graph.py", line 744, in _engine_run_backward
[rank0]:     return Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 3.91 GiB. GPU
wandb: 🚀 View run 8-16-8192-v0.1 at: https://wandb.ai/owenarliawan/formax/runs/eytn2p12
wandb: ⭐️ View project at: https://wandb.ai/owenarliawan/formax
wandb: Synced 6 W&B file(s), 0 media file(s), 1 artifact file(s) and 1 other file(s)
wandb: Find logs at: ./wandb/run-20240726_043938-eytn2p12/logs
wandb: WARNING The new W&B backend becomes opt-out in version 0.18.0; try it out with `wandb.require("core")`! See https://wandb.me/wandb-core for more information.
W0726 04:40:18.105000 139726073537600 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 2411323 closing signal SIGTERM
E0726 04:40:18.722000 139726073537600 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: 1) local_rank: 0 (pid: 2411322) of binary: /home/owen/miniconda3/envs/axolotl/bin/python
Traceback (most recent call last):
  File "/home/owen/miniconda3/envs/axolotl/bin/accelerate", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/home/owen/miniconda3/envs/axolotl/lib/python3.11/site-packages/accelerate/commands/accelerate_cli.py", line 48, in main
    args.func(args)
  File "/home/owen/miniconda3/envs/axolotl/lib/python3.11/site-packages/accelerate/commands/launch.py", line 1088, in launch_command
    multi_gpu_launcher(args)
  File "/home/owen/miniconda3/envs/axolotl/lib/python3.11/site-packages/accelerate/commands/launch.py", line 733, in multi_gpu_launcher
    distrib_run.run(args)
  File "/home/owen/miniconda3/envs/axolotl/lib/python3.11/site-packages/torch/distributed/run.py", line 870, in run
    elastic_launch(
  File "/home/owen/miniconda3/envs/axolotl/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/owen/miniconda3/envs/axolotl/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 263, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
axolotl.cli.train FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-07-26_04:40:18
  host      : COMPUTE-PC.
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 2411322)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

For FSDP it just OOM at the beginning no matter what, it doesn't even load the model until I max out the GPU VRAM yet. I tried with the lowest VRAM use setting at 512 sequence length and qlora and it still OOMs.

Capture2

2x24GB Llama 3.1 8B QLORA FSDP Config:

base_model: /home/owen/models/Meta-Llama-3.1-8B-Instruct
# model_type: LlamaForCausalLM
tokenizer_type: AutoTokenizer
  
train_on_inputs: false
group_by_length: false
load_in_8bit: false
load_in_4bit: true
strict: false
sequence_len: 512
bf16: auto
fp16: 
tf32: false
flash_attention: true

shuffle_merged_datasets: true

# Data
datasets:
  - path: /home/owen/datasets/train.jsonl
    type: sharegpt
    conversation: llama-3
 
warmup_steps: 10
dataset_prepared_path: ./last_run_prepared

# Iterations
num_epochs: 1
saves_per_epoch: 4

# Evaluation
val_set_size: 0.0025
eval_table_size:
eval_table_max_new_tokens: 128
eval_sample_packing: false
evals_per_epoch: 8

# LoRA
output_dir: ./lora-out
adapter: qlora
lora_model_dir:
lora_r: 8
lora_alpha: 16
lora_dropout: 0.05
lora_target_linear: true
lora_fan_in_fan_out:
lora_target_modules:
save_safetensors: true

# Sampling
sample_packing: false
pad_to_sequence_len: true

# Batching
gradient_accumulation_steps: 32
micro_batch_size: 1
gradient_checkpointing: true
gradient_checkpointing_kwargs:
  use_reentrant: true

# wandb
wandb_mode: # "offline" to save run metadata locally and not sync to the server, "disabled" to turn off wandb
wandb_project: formax
wandb_entity: # A wandb Team name if using a Team
wandb_watch:
wandb_name: 8-16-8192-v0.1
wandb_run_id: # Set the ID of your wandb run
wandb_log_model: # "checkpoint" to log model to wandb Artifacts every `save_steps` or "end" to log only at the end of training

# Optimizer
optimizer: adamw_torch
lr_scheduler: cosine
learning_rate: 0.00001

# Misc
early_stopping_patience:
auto_resume_from_checkpoints: true
logging_steps: 1
debug:
weight_decay: 0.1
special_tokens:
  eos_token: "<|eot_id|>"
  pad_token: "<|end_of_text|>"

# Multi-GPU
deepspeed:
fsdp:
  - full_shard
  - auto_wrap
fsdp_config:
  fsdp_limit_all_gathers: true
  fsdp_sync_module_states: true
  fsdp_offload_params: true
  fsdp_use_orig_params: false
  fsdp_cpu_ram_efficient_loading: true
  fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
  fsdp_transformer_layer_cls_to_wrap: LlamaDecoderLayer
  fsdp_state_dict_type: FULL_STATE_DICT
  fsdp_sharding_strategy: FULL_SHARD

2x24GB Llama 3.1 8B QLORA FSDP Traceback:

accelerate launch -m axolotl.cli.train qlora-sft-fsdp.yml
The following values were not passed to `accelerate launch` and had defaults used instead:
        `--num_processes` was set to a value of `2`
                More than one GPU was found, enabling multi-GPU training.
                If this was unintended please pass in `--num_processes=1`.
        `--num_machines` was set to a value of `1`
        `--mixed_precision` was set to a value of `'no'`
        `--dynamo_backend` was set to a value of `'no'`
To avoid this warning pass in values for each of the problematic parameters or run `accelerate config`.
[2024-07-26 04:35:07,198] [INFO] [datasets.<module>:58] [PID:2026383] PyTorch version 2.3.1 available.
[2024-07-26 04:35:07,262] [INFO] [datasets.<module>:58] [PID:2026385] PyTorch version 2.3.1 available.
[2024-07-26 04:35:08,342] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-07-26 04:35:08,379] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-07-26 04:35:08,407] [INFO] [root.spawn:38] [PID:2026383] gcc -pthread -B /home/owen/miniconda3/envs/axolotl/compiler_compat -DNDEBUG -fwrapv -O2 -Wall -fPIC -O2 -isystem /home/owen/miniconda3/envs/axolotl/include -fPIC -O2 -isystem /home/owen/miniconda3/envs/axolotl/include -fPIC -c /tmp/tmpl5z1xjbv/test.c -o /tmp/tmpl5z1xjbv/test.o
[2024-07-26 04:35:08,424] [INFO] [root.spawn:38] [PID:2026383] gcc -pthread -B /home/owen/miniconda3/envs/axolotl/compiler_compat /tmp/tmpl5z1xjbv/test.o -laio -o /tmp/tmpl5z1xjbv/a.out
[2024-07-26 04:35:08,445] [INFO] [root.spawn:38] [PID:2026385] gcc -pthread -B /home/owen/miniconda3/envs/axolotl/compiler_compat -DNDEBUG -fwrapv -O2 -Wall -fPIC -O2 -isystem /home/owen/miniconda3/envs/axolotl/include -fPIC -O2 -isystem /home/owen/miniconda3/envs/axolotl/include -fPIC -c /tmp/tmpbxfe5106/test.c -o /tmp/tmpbxfe5106/test.o
[2024-07-26 04:35:08,460] [INFO] [root.spawn:38] [PID:2026385] gcc -pthread -B /home/owen/miniconda3/envs/axolotl/compiler_compat /tmp/tmpbxfe5106/test.o -laio -o /tmp/tmpbxfe5106/a.out
 [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.
 [WARNING]  async_io: please install the libaio-dev package with apt
 [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
 [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
 [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.3
 [WARNING]  using untested triton version (2.3.1), only 1.0.0 is known to be compatible
 [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.
 [WARNING]  async_io: please install the libaio-dev package with apt
 [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
 [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
 [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.3
 [WARNING]  using untested triton version (2.3.1), only 1.0.0 is known to be compatible
[2024-07-26 04:35:10,398] [DEBUG] [axolotl.normalize_config:80] [PID:2026385] [RANK:1] bf16 support detected, enabling for this configuration.
[2024-07-26 04:35:10,400] [INFO] [axolotl.normalize_config:183] [PID:2026385] [RANK:1] GPU memory usage baseline: 0.000GB (+0.497GB misc)
[2024-07-26 04:35:10,474] [DEBUG] [axolotl.normalize_config:80] [PID:2026383] [RANK:0] bf16 support detected, enabling for this configuration.
[2024-07-26 04:35:10,476] [INFO] [axolotl.normalize_config:183] [PID:2026383] [RANK:0] GPU memory usage baseline: 0.000GB (+0.497GB misc)
                                 dP            dP   dP
                                 88            88   88
      .d8888b. dP.  .dP .d8888b. 88 .d8888b. d8888P 88
      88'  `88  `8bd8'  88'  `88 88 88'  `88   88   88
      88.  .88  .d88b.  88.  .88 88 88.  .88   88   88
      `88888P8 dP'  `dP `88888P' dP `88888P'   dP   dP



****************************************
**** Axolotl Dependency Versions *****
  accelerate: 0.32.0
        peft: 0.11.1
transformers: 4.43.1
         trl: 0.9.6
       torch: 2.3.1
bitsandbytes: 0.43.1
****************************************
[2024-07-26 04:35:10,536] [WARNING] [axolotl.scripts.check_user_token:487] [PID:2026383] [RANK:0] Error verifying HuggingFace token. Remember to log in using `huggingface-cli login` and get your access token from https://huggingface.co/settings/tokens if you want to use gated models or datasets.
[2024-07-26 04:35:10,766] [WARNING] [axolotl.scripts.check_user_token:487] [PID:2026385] [RANK:1] Error verifying HuggingFace token. Remember to log in using `huggingface-cli login` and get your access token from https://huggingface.co/settings/tokens if you want to use gated models or datasets.
[2024-07-26 04:35:10,861] [DEBUG] [axolotl.load_tokenizer:280] [PID:2026383] [RANK:0] EOS: 128009 / <|eot_id|>
[2024-07-26 04:35:10,861] [DEBUG] [axolotl.load_tokenizer:281] [PID:2026383] [RANK:0] BOS: 128000 / <|begin_of_text|>
[2024-07-26 04:35:10,861] [DEBUG] [axolotl.load_tokenizer:282] [PID:2026383] [RANK:0] PAD: 128001 / <|end_of_text|>
[2024-07-26 04:35:10,861] [DEBUG] [axolotl.load_tokenizer:283] [PID:2026383] [RANK:0] UNK: None / None
[2024-07-26 04:35:10,861] [INFO] [axolotl.load_tokenizer:294] [PID:2026383] [RANK:0] No Chat template selected. Consider adding a chat template for easier inference.
[2024-07-26 04:35:10,862] [INFO] [axolotl.load_tokenized_prepared_datasets:179] [PID:2026383] [RANK:0] Loading prepared dataset from disk at last_run_prepared/61739362705bb417c0a56287c5686aba...
[2024-07-26 04:35:10,880] [INFO] [axolotl.load_tokenized_prepared_datasets:181] [PID:2026383] [RANK:0] Prepared dataset loaded from disk...
[2024-07-26 04:35:11,086] [DEBUG] [axolotl.load_tokenizer:280] [PID:2026385] [RANK:1] EOS: 128009 / <|eot_id|>
[2024-07-26 04:35:11,086] [DEBUG] [axolotl.load_tokenizer:281] [PID:2026385] [RANK:1] BOS: 128000 / <|begin_of_text|>
[2024-07-26 04:35:11,087] [DEBUG] [axolotl.load_tokenizer:282] [PID:2026385] [RANK:1] PAD: 128001 / <|end_of_text|>
[2024-07-26 04:35:11,087] [DEBUG] [axolotl.load_tokenizer:283] [PID:2026385] [RANK:1] UNK: None / None
[2024-07-26 04:35:11,087] [INFO] [axolotl.load_tokenizer:294] [PID:2026385] [RANK:1] No Chat template selected. Consider adding a chat template for easier inference.
[2024-07-26 04:35:11,575] [INFO] [axolotl.load_tokenized_prepared_datasets:179] [PID:2026385] [RANK:1] Loading prepared dataset from disk at last_run_prepared/61739362705bb417c0a56287c5686aba...
[2024-07-26 04:35:11,590] [INFO] [axolotl.load_tokenized_prepared_datasets:181] [PID:2026385] [RANK:1] Prepared dataset loaded from disk...
[2024-07-26 04:35:12,079] [DEBUG] [axolotl.calculate_total_num_steps:297] [PID:2026383] [RANK:0] total_num_tokens: 100_647_449
[2024-07-26 04:35:16,162] [DEBUG] [axolotl.calculate_total_num_steps:310] [PID:2026383] [RANK:0] `total_supervised_tokens: 42_586_235`
[2024-07-26 04:35:16,162] [DEBUG] [axolotl.calculate_total_num_steps:388] [PID:2026383] [RANK:0] total_num_steps: 6662
[2024-07-26 04:35:16,185] [DEBUG] [axolotl.train.train:66] [PID:2026383] [RANK:0] loading tokenizer... /home/owen/models/Meta-Llama-3.1-8B-Instruct
[2024-07-26 04:35:16,474] [DEBUG] [axolotl.load_tokenizer:280] [PID:2026383] [RANK:0] EOS: 128009 / <|eot_id|>
[2024-07-26 04:35:16,474] [DEBUG] [axolotl.load_tokenizer:281] [PID:2026383] [RANK:0] BOS: 128000 / <|begin_of_text|>
[2024-07-26 04:35:16,474] [DEBUG] [axolotl.load_tokenizer:282] [PID:2026383] [RANK:0] PAD: 128001 / <|end_of_text|>
[2024-07-26 04:35:16,474] [DEBUG] [axolotl.load_tokenizer:283] [PID:2026383] [RANK:0] UNK: None / None
[2024-07-26 04:35:16,474] [INFO] [axolotl.load_tokenizer:294] [PID:2026383] [RANK:0] No Chat template selected. Consider adding a chat template for easier inference.
[2024-07-26 04:35:16,475] [DEBUG] [axolotl.train.train:95] [PID:2026383] [RANK:0] loading model and peft_config...
[2024-07-26 04:35:16,551] [DEBUG] [axolotl.load_tokenizer:280] [PID:2026385] [RANK:1] EOS: 128009 / <|eot_id|>
[2024-07-26 04:35:16,551] [DEBUG] [axolotl.load_tokenizer:281] [PID:2026385] [RANK:1] BOS: 128000 / <|begin_of_text|>
[2024-07-26 04:35:16,551] [DEBUG] [axolotl.load_tokenizer:282] [PID:2026385] [RANK:1] PAD: 128001 / <|end_of_text|>
[2024-07-26 04:35:16,551] [DEBUG] [axolotl.load_tokenizer:283] [PID:2026385] [RANK:1] UNK: None / None
[2024-07-26 04:35:16,551] [INFO] [axolotl.load_tokenizer:294] [PID:2026385] [RANK:1] No Chat template selected. Consider adding a chat template for easier inference.
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████| 4/4 [00:18<00:00,  4.54s/it]
[2024-07-26 04:35:35,521] [INFO] [axolotl.load_model:764] [PID:2026383] [RANK:0] GPU memory usage after model load: 2.061GB (+0.002GB cache, +0.967GB misc)
[2024-07-26 04:35:35,523] [INFO] [axolotl.load_lora:986] [PID:2026383] [RANK:0] found linear modules: ['k_proj', 'o_proj', 'up_proj', 'v_proj', 'q_proj', 'down_proj', 'gate_proj']
trainable params: 20,971,520 || all params: 8,051,232,768 || trainable%: 0.2605
[2024-07-26 04:35:36,051] [INFO] [axolotl.load_model:869] [PID:2026383] [RANK:0] GPU memory usage after adapters: 1.958GB (+0.105GB cache, +0.967GB misc)
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████| 4/4 [00:18<00:00,  4.69s/it]
[2024-07-26 04:35:36,215] [INFO] [axolotl.load_model:764] [PID:2026385] [RANK:1] GPU memory usage after model load: 2.061GB (+0.002GB cache, +0.967GB misc)
[2024-07-26 04:35:36,218] [INFO] [axolotl.load_lora:986] [PID:2026385] [RANK:1] found linear modules: ['v_proj', 'k_proj', 'q_proj', 'up_proj', 'gate_proj', 'down_proj', 'o_proj']
/home/owen/miniconda3/envs/axolotl/lib/python3.11/site-packages/transformers/training_args.py:1525: FutureWarning: `evaluation_strategy` is deprecated and will be removed in version 4.46 of 🤗 Transformers. Use `eval_strategy` instead
  warnings.warn(
[2024-07-26 04:35:36,618] [INFO] [axolotl.load_model:869] [PID:2026385] [RANK:1] GPU memory usage after adapters: 2.061GB (+0.002GB cache, +0.908GB misc)
When using FSDP full shard, instead of using `gradient_checkpointing` in TrainingArguments, please use `activation_checkpointing` in `fsdp_config`. The former introduces a redundant AllGather operation in backward pass. Reference: https://github.com/huggingface/transformers/issues/30404
[2024-07-26 04:35:36,633] [INFO] [axolotl.train.train:136] [PID:2026383] [RANK:0] Pre-saving adapter config to ./lora-out
[2024-07-26 04:35:36,809] [INFO] [axolotl.train.train:173] [PID:2026383] [RANK:0] Starting trainer...
/home/owen/miniconda3/envs/axolotl/lib/python3.11/site-packages/transformers/training_args.py:1525: FutureWarning: `evaluation_strategy` is deprecated and will be removed in version 4.46 of 🤗 Transformers. Use `eval_strategy` instead
  warnings.warn(
When using FSDP full shard, instead of using `gradient_checkpointing` in TrainingArguments, please use `activation_checkpointing` in `fsdp_config`. The former introduces a redundant AllGather operation in backward pass. Reference: https://github.com/huggingface/transformers/issues/30404
/home/owen/miniconda3/envs/axolotl/lib/python3.11/site-packages/accelerate/accelerator.py:1550: UserWarning: Upcasted low precision parameters in Linear because mixed precision turned on in FSDP. Affects: weight.
  warnings.warn(
/home/owen/miniconda3/envs/axolotl/lib/python3.11/site-packages/accelerate/accelerator.py:1556: UserWarning: FSDP upcast of low precision parameters may affect the precision of model checkpoints.
  warnings.warn(
You're using a PreTrainedTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
[rank1]: Traceback (most recent call last):
[rank1]:   File "<frozen runpy>", line 198, in _run_module_as_main
[rank1]:   File "<frozen runpy>", line 88, in _run_code
[rank1]:   File "/home/owen/axolotl/src/axolotl/cli/train.py", line 72, in <module>
[rank1]:     fire.Fire(do_cli)
[rank1]:   File "/home/owen/miniconda3/envs/axolotl/lib/python3.11/site-packages/fire/core.py", line 143, in Fire
[rank1]:     component_trace = _Fire(component, args, parsed_flag_args, context, name)
[rank1]:                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/home/owen/miniconda3/envs/axolotl/lib/python3.11/site-packages/fire/core.py", line 477, in _Fire
[rank1]:     component, remaining_args = _CallAndUpdateTrace(
[rank1]:                                 ^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/home/owen/miniconda3/envs/axolotl/lib/python3.11/site-packages/fire/core.py", line 693, in _CallAndUpdateTrace
[rank1]:     component = fn(*varargs, **kwargs)
[rank1]:                 ^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/home/owen/axolotl/src/axolotl/cli/train.py", line 39, in do_cli
[rank1]:     return do_train(parsed_cfg, parsed_cli_args)
[rank1]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/home/owen/axolotl/src/axolotl/cli/train.py", line 67, in do_train
[rank1]:     return train(cfg=cfg, cli_args=cli_args, dataset_meta=dataset_meta)
[rank1]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/home/owen/axolotl/src/axolotl/train.py", line 187, in train
[rank1]:     trainer.train(resume_from_checkpoint=resume_from_checkpoint)
[rank1]:   File "/home/owen/miniconda3/envs/axolotl/lib/python3.11/site-packages/transformers/trainer.py", line 1938, in train
[rank1]:     return inner_training_loop(
[rank1]:            ^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/home/owen/miniconda3/envs/axolotl/lib/python3.11/site-packages/transformers/trainer.py", line 2278, in _inner_training_loop
[rank1]:     with self.accelerator.accumulate(model):
[rank1]:   File "/home/owen/miniconda3/envs/axolotl/lib/python3.11/contextlib.py", line 137, in __enter__
[rank1]:     return next(self.gen)
[rank1]:            ^^^^^^^^^^^^^^
[rank1]:   File "/home/owen/miniconda3/envs/axolotl/lib/python3.11/site-packages/accelerate/accelerator.py", line 1069, in accumulate
[rank1]:     cm_stack.enter_context(contextlib.nullcontext() if allow_gradient_sync else self.no_sync(m))
[rank1]:   File "/home/owen/miniconda3/envs/axolotl/lib/python3.11/contextlib.py", line 517, in enter_context
[rank1]:     result = _enter(cm)
[rank1]:              ^^^^^^^^^^
[rank1]:   File "/home/owen/miniconda3/envs/axolotl/lib/python3.11/contextlib.py", line 137, in __enter__
[rank1]:     return next(self.gen)
[rank1]:            ^^^^^^^^^^^^^^
[rank1]:   File "/home/owen/miniconda3/envs/axolotl/lib/python3.11/site-packages/accelerate/accelerator.py", line 950, in no_sync
[rank1]:     with context():
[rank1]:   File "/home/owen/miniconda3/envs/axolotl/lib/python3.11/contextlib.py", line 137, in __enter__
[rank1]:     return next(self.gen)
[rank1]:            ^^^^^^^^^^^^^^
[rank1]:   File "/home/owen/miniconda3/envs/axolotl/lib/python3.11/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 1048, in no_sync
[rank1]:     _lazy_init(self, self)
[rank1]:   File "/home/owen/miniconda3/envs/axolotl/lib/python3.11/site-packages/torch/distributed/fsdp/_runtime_utils.py", line 138, in _lazy_init
[rank1]:     _share_state_and_init_handle_attrs(state, root_module)
[rank1]:   File "/home/owen/miniconda3/envs/axolotl/lib/python3.11/site-packages/torch/distributed/fsdp/_runtime_utils.py", line 178, in _share_state_and_init_handle_attrs
[rank1]:     handle.init_flat_param_attributes()
[rank1]:   File "/home/owen/miniconda3/envs/axolotl/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
[rank1]:     return func(*args, **kwargs)
[rank1]:            ^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/home/owen/miniconda3/envs/axolotl/lib/python3.11/site-packages/torch/distributed/fsdp/_flat_param.py", line 1187, in init_flat_param_attributes
[rank1]:     flat_param._local_shard = flat_param._local_shard.pin_memory()
[rank1]:                               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: RuntimeError: CUDA error: out of memory
[rank1]: CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
[rank1]: For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
[rank1]: Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

wandb: Currently logged in as: owenarliawan. Use `wandb login --relogin` to force relogin
W0726 04:35:44.884000 140018420294720 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 2026383 closing signal SIGTERM
E0726 04:35:45.250000 140018420294720 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: 1) local_rank: 1 (pid: 2026385) of binary: /home/owen/miniconda3/envs/axolotl/bin/python
Traceback (most recent call last):
  File "/home/owen/miniconda3/envs/axolotl/bin/accelerate", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/home/owen/miniconda3/envs/axolotl/lib/python3.11/site-packages/accelerate/commands/accelerate_cli.py", line 48, in main
    args.func(args)
  File "/home/owen/miniconda3/envs/axolotl/lib/python3.11/site-packages/accelerate/commands/launch.py", line 1088, in launch_command
    multi_gpu_launcher(args)
  File "/home/owen/miniconda3/envs/axolotl/lib/python3.11/site-packages/accelerate/commands/launch.py", line 733, in multi_gpu_launcher
    distrib_run.run(args)
  File "/home/owen/miniconda3/envs/axolotl/lib/python3.11/site-packages/torch/distributed/run.py", line 870, in run
    elastic_launch(
  File "/home/owen/miniconda3/envs/axolotl/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/owen/miniconda3/envs/axolotl/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 263, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
axolotl.cli.train FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-07-26_04:35:44
  host      : COMPUTE-PC.
  rank      : 1 (local_rank: 1)
  exitcode  : 1 (pid: 2026385)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

@Nero10578 actually, #1742 should fix this for you.

I also tried 70B with FSDP which you showed working on 2x4090. But I also get OOM on that.

I am using the example config here: https://github.com/axolotl-ai-cloud/axolotl/blob/main/examples/llama-3/qlora-fsdp-70b.yaml

I am starting to suspect it is WSL2 somehow not allowing as efficient use of VRAM? I do have 256GB of RAM and 220GB allocated to WSL2, but you mentioned it might need 250GB? So maybe I just need to allocate more?

Again I don't even see it max out the VRAM with FSDP:
Capture3

Config:

base_model: /home/owen/models/Meta-Llama-3-70B-Instruct-abliterated-v3.5
model_type: LlamaForCausalLM
tokenizer_type: AutoTokenizer  # PreTrainedTokenizerFast

load_in_8bit: false
load_in_4bit: true
strict: false

# Data
datasets:
  - path: /home/owen/datasets/train.jsonl
    type: sharegpt
    conversation: llama-3
    
dataset_prepared_path: ./last_run_prepared
val_set_size: 0.05
output_dir: ./qlora-llama3-70b

adapter: qlora
lora_model_dir: 

sequence_len: 512
sample_packing: false
pad_to_sequence_len: true

lora_r: 8
lora_alpha: 16
lora_dropout: 0.05
lora_target_modules:
lora_target_linear: true
lora_fan_in_fan_out:

wandb_project:
wandb_entity:
wandb_watch:
wandb_name:
wandb_log_model:

gradient_accumulation_steps: 4
micro_batch_size: 1
num_epochs: 4
optimizer: adamw_torch
lr_scheduler: cosine
learning_rate: 0.00001

train_on_inputs: false
group_by_length: false
bf16: auto
fp16:
tf32: false

gradient_checkpointing: true
gradient_checkpointing_kwargs:
  use_reentrant: true
early_stopping_patience:
resume_from_checkpoint:
local_rank:
logging_steps: 1
xformers_attention:
flash_attention: true

warmup_steps: 10
evals_per_epoch: 4
eval_table_size:
saves_per_epoch: 1
debug:
deepspeed:
weight_decay: 0.0
fsdp:
  - full_shard
  - auto_wrap
fsdp_config:
  fsdp_limit_all_gathers: true
  fsdp_sync_module_states: true
  fsdp_offload_params: true
  fsdp_use_orig_params: false
  fsdp_cpu_ram_efficient_loading: true
  fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
  fsdp_transformer_layer_cls_to_wrap: LlamaDecoderLayer
  fsdp_state_dict_type: FULL_STATE_DICT
  fsdp_sharding_strategy: FULL_SHARD
special_tokens:
  pad_token: <|end_of_text|>

Traceback:

accelerate launch -m axolotl.cli.train qlora-sft-fsdp.yml
The following values were not passed to `accelerate launch` and had defaults used instead:
        `--num_processes` was set to a value of `2`
                More than one GPU was found, enabling multi-GPU training.
                If this was unintended please pass in `--num_processes=1`.
        `--num_machines` was set to a value of `1`
        `--mixed_precision` was set to a value of `'no'`
        `--dynamo_backend` was set to a value of `'no'`
To avoid this warning pass in values for each of the problematic parameters or run `accelerate config`.
[2024-07-26 04:49:22,657] [INFO] [datasets.<module>:58] [PID:2411673] PyTorch version 2.3.1 available.
[2024-07-26 04:49:22,658] [INFO] [datasets.<module>:58] [PID:2411672] PyTorch version 2.3.1 available.
[2024-07-26 04:49:24,624] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-07-26 04:49:24,626] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-07-26 04:49:24,732] [INFO] [root.spawn:38] [PID:2411673] gcc -pthread -B /home/owen/miniconda3/envs/axolotl/compiler_compat -DNDEBUG -fwrapv -O2 -Wall -fPIC -O2 -isystem /home/owen/miniconda3/envs/axolotl/include -fPIC -O2 -isystem /home/owen/miniconda3/envs/axolotl/include -fPIC -c /tmp/tmpzj4i8mqc/test.c -o /tmp/tmpzj4i8mqc/test.o
[2024-07-26 04:49:24,732] [INFO] [root.spawn:38] [PID:2411672] gcc -pthread -B /home/owen/miniconda3/envs/axolotl/compiler_compat -DNDEBUG -fwrapv -O2 -Wall -fPIC -O2 -isystem /home/owen/miniconda3/envs/axolotl/include -fPIC -O2 -isystem /home/owen/miniconda3/envs/axolotl/include -fPIC -c /tmp/tmp1agqw6aq/test.c -o /tmp/tmp1agqw6aq/test.o
[2024-07-26 04:49:24,821] [INFO] [root.spawn:38] [PID:2411673] gcc -pthread -B /home/owen/miniconda3/envs/axolotl/compiler_compat /tmp/tmpzj4i8mqc/test.o -laio -o /tmp/tmpzj4i8mqc/a.out
[2024-07-26 04:49:24,821] [INFO] [root.spawn:38] [PID:2411672] gcc -pthread -B /home/owen/miniconda3/envs/axolotl/compiler_compat /tmp/tmp1agqw6aq/test.o -laio -o /tmp/tmp1agqw6aq/a.out
 [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.
 [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.
 [WARNING]  async_io: please install the libaio-dev package with apt
 [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
 [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
 [WARNING]  async_io: please install the libaio-dev package with apt
 [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
 [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
 [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.3
 [WARNING]  using untested triton version (2.3.1), only 1.0.0 is known to be compatible
 [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.3
 [WARNING]  using untested triton version (2.3.1), only 1.0.0 is known to be compatible
[2024-07-26 04:49:28,071] [DEBUG] [axolotl.normalize_config:80] [PID:2411672] [RANK:0] bf16 support detected, enabling for this configuration.
[2024-07-26 04:49:28,073] [DEBUG] [axolotl.normalize_config:80] [PID:2411673] [RANK:1] bf16 support detected, enabling for this configuration.
[2024-07-26 04:49:28,075] [INFO] [axolotl.normalize_config:183] [PID:2411672] [RANK:0] GPU memory usage baseline: 0.000GB (+0.376GB misc)
[2024-07-26 04:49:28,075] [INFO] [axolotl.normalize_config:183] [PID:2411673] [RANK:1] GPU memory usage baseline: 0.000GB (+0.376GB misc)
                                 dP            dP   dP
                                 88            88   88
      .d8888b. dP.  .dP .d8888b. 88 .d8888b. d8888P 88
      88'  `88  `8bd8'  88'  `88 88 88'  `88   88   88
      88.  .88  .d88b.  88.  .88 88 88.  .88   88   88
      `88888P8 dP'  `dP `88888P' dP `88888P'   dP   dP



****************************************
**** Axolotl Dependency Versions *****
  accelerate: 0.32.0
        peft: 0.11.1
transformers: 4.43.1
         trl: 0.9.6
       torch: 2.3.1
bitsandbytes: 0.43.1
****************************************
[2024-07-26 04:49:28,131] [WARNING] [axolotl.scripts.check_user_token:487] [PID:2411672] [RANK:0] Error verifying HuggingFace token. Remember to log in using `huggingface-cli login` and get your access token from https://huggingface.co/settings/tokens if you want to use gated models or datasets.
[2024-07-26 04:49:28,413] [WARNING] [axolotl.scripts.check_user_token:487] [PID:2411673] [RANK:1] Error verifying HuggingFace token. Remember to log in using `huggingface-cli login` and get your access token from https://huggingface.co/settings/tokens if you want to use gated models or datasets.
[2024-07-26 04:49:28,483] [DEBUG] [axolotl.load_tokenizer:280] [PID:2411672] [RANK:0] EOS: 128009 / <|eot_id|>
[2024-07-26 04:49:28,483] [DEBUG] [axolotl.load_tokenizer:281] [PID:2411672] [RANK:0] BOS: 128000 / <|begin_of_text|>
[2024-07-26 04:49:28,483] [DEBUG] [axolotl.load_tokenizer:282] [PID:2411672] [RANK:0] PAD: 128001 / <|end_of_text|>
[2024-07-26 04:49:28,483] [DEBUG] [axolotl.load_tokenizer:283] [PID:2411672] [RANK:0] UNK: None / None
[2024-07-26 04:49:28,483] [INFO] [axolotl.load_tokenizer:294] [PID:2411672] [RANK:0] No Chat template selected. Consider adding a chat template for easier inference.
[2024-07-26 04:49:28,483] [INFO] [axolotl.load_tokenized_prepared_datasets:183] [PID:2411672] [RANK:0] Unable to find prepared dataset in last_run_prepared/07aa2af7881c3ca21e89fc99ca102cd6
[2024-07-26 04:49:28,483] [INFO] [axolotl.load_tokenized_prepared_datasets:184] [PID:2411672] [RANK:0] Loading raw datasets...
[2024-07-26 04:49:28,483] [WARNING] [axolotl.load_tokenized_prepared_datasets:186] [PID:2411672] [RANK:0] Processing datasets during training can lead to VRAM instability. Please pre-process your dataset.
[2024-07-26 04:49:28,483] [INFO] [axolotl.load_tokenized_prepared_datasets:193] [PID:2411672] [RANK:0] No seed provided, using default seed of 42
[2024-07-26 04:49:28,771] [DEBUG] [axolotl.load_tokenizer:280] [PID:2411673] [RANK:1] EOS: 128009 / <|eot_id|>
[2024-07-26 04:49:28,771] [DEBUG] [axolotl.load_tokenizer:281] [PID:2411673] [RANK:1] BOS: 128000 / <|begin_of_text|>
[2024-07-26 04:49:28,771] [DEBUG] [axolotl.load_tokenizer:282] [PID:2411673] [RANK:1] PAD: 128001 / <|end_of_text|>
[2024-07-26 04:49:28,771] [DEBUG] [axolotl.load_tokenizer:283] [PID:2411673] [RANK:1] UNK: None / None
[2024-07-26 04:49:28,771] [INFO] [axolotl.load_tokenizer:294] [PID:2411673] [RANK:1] No Chat template selected. Consider adding a chat template for easier inference.
[2024-07-26 04:49:30,066] [INFO] [axolotl.get_dataset_wrapper:540] [PID:2411672] [RANK:0] Loading dataset with base_type: sharegpt and prompt_style: None
Tokenizing Prompts (num_proc=40): 100%|██████████████████████████████████| 54376/54376 [00:11<00:00, 4786.30 examples/s]
[2024-07-26 04:49:45,069] [INFO] [axolotl.get_dataset_wrapper:540] [PID:2411672] [RANK:0] Loading dataset with base_type: sharegpt and prompt_style: None
Tokenizing Prompts (num_proc=40): 100%|████████████████████████████████| 456361/456361 [00:48<00:00, 9346.17 examples/s]
[2024-07-26 04:50:35,966] [INFO] [axolotl.get_dataset_wrapper:540] [PID:2411672] [RANK:0] Loading dataset with base_type: sharegpt and prompt_style: None
Tokenizing Prompts (num_proc=40): 100%|██████████████████████████████████| 10000/10000 [00:07<00:00, 1415.65 examples/s]
[2024-07-26 04:50:44,827] [INFO] [axolotl.get_dataset_wrapper:540] [PID:2411672] [RANK:0] Loading dataset with base_type: sharegpt and prompt_style: None
Tokenizing Prompts (num_proc=40): 100%|████████████████████████████████████| 6866/6866 [00:06<00:00, 1004.21 examples/s]
[2024-07-26 04:50:53,497] [INFO] [axolotl.get_dataset_wrapper:540] [PID:2411672] [RANK:0] Loading dataset with base_type: sharegpt and prompt_style: None
Tokenizing Prompts (num_proc=40): 100%|████████████████████████████████████████| 302/302 [00:06<00:00, 48.89 examples/s]
[2024-07-26 04:51:00,193] [INFO] [axolotl.load_tokenized_prepared_datasets:414] [PID:2411672] [RANK:0] merging datasets
[2024-07-26 04:51:00,243] [DEBUG] [axolotl.load_tokenized_prepared_datasets:419] [PID:2411672] [RANK:0] shuffle merged datasets
Dropping Long Sequences (num_proc=40): 100%|██████████████████████████| 527905/527905 [00:11<00:00, 46759.39 examples/s]
[2024-07-26 04:51:12,997] [INFO] [axolotl.load_tokenized_prepared_datasets:183] [PID:2411673] [RANK:1] Unable to find prepared dataset in last_run_prepared/07aa2af7881c3ca21e89fc99ca102cd6
[2024-07-26 04:51:12,998] [INFO] [axolotl.load_tokenized_prepared_datasets:184] [PID:2411673] [RANK:1] Loading raw datasets...
[2024-07-26 04:51:12,998] [WARNING] [axolotl.load_tokenized_prepared_datasets:186] [PID:2411673] [RANK:1] Processing datasets during training can lead to VRAM instability. Please pre-process your dataset.
[2024-07-26 04:51:12,998] [INFO] [axolotl.load_tokenized_prepared_datasets:193] [PID:2411673] [RANK:1] No seed provided, using default seed of 42
[2024-07-26 04:51:12,998] [INFO] [axolotl.load_tokenized_prepared_datasets:427] [PID:2411672] [RANK:0] Saving merged prepared dataset to disk... last_run_prepared/07aa2af7881c3ca21e89fc99ca102cd6
Saving the dataset (1/4 shards):  27%|████████▋                       | 116763/427052 [00:01<00:03, 92513.00 examples/s][2024-07-26 04:51:14,349] [INFO] [axolotl.get_dataset_wrapper:540] [PID:2411673] [RANK:1] Loading dataset with base_type: sharegpt and prompt_style: None
Saving the dataset (2/4 shards):  55%|█████████████████▋              | 235526/427052 [00:02<00:01, 98178.17 examples/s][2024-07-26 04:51:15,570] [INFO] [axolotl.get_dataset_wrapper:540] [PID:2411673] [RANK:1] Loading dataset with base_type: sharegpt and prompt_style: None
Saving the dataset (3/4 shards):  86%|███████████████████████████▎    | 365289/427052 [00:03<00:00, 95701.74 examples/s][2024-07-26 04:51:16,939] [INFO] [axolotl.get_dataset_wrapper:540] [PID:2411673] [RANK:1] Loading dataset with base_type: sharegpt and prompt_style: None
Saving the dataset (4/4 shards): 100%|████████████████████████████████| 427052/427052 [00:04<00:00, 96245.21 examples/s]
[2024-07-26 04:51:18,113] [INFO] [axolotl.get_dataset_wrapper:540] [PID:2411673] [RANK:1] Loading dataset with base_type: sharegpt and prompt_style: None
[2024-07-26 04:51:19,330] [INFO] [axolotl.get_dataset_wrapper:540] [PID:2411673] [RANK:1] Loading dataset with base_type: sharegpt and prompt_style: None
[2024-07-26 04:51:19,525] [INFO] [axolotl.load_tokenized_prepared_datasets:414] [PID:2411673] [RANK:1] merging datasets
[2024-07-26 04:51:19,566] [DEBUG] [axolotl.load_tokenized_prepared_datasets:419] [PID:2411673] [RANK:1] shuffle merged datasets
[2024-07-26 04:51:20,131] [DEBUG] [axolotl.calculate_total_num_steps:297] [PID:2411672] [RANK:0] total_num_tokens: 166_679_416
[2024-07-26 04:51:24,993] [DEBUG] [axolotl.calculate_total_num_steps:310] [PID:2411672] [RANK:0] `total_supervised_tokens: 70_344_664`
[2024-07-26 04:51:24,994] [DEBUG] [axolotl.calculate_total_num_steps:388] [PID:2411672] [RANK:0] total_num_steps: 202850
[2024-07-26 04:51:25,017] [DEBUG] [axolotl.train.train:66] [PID:2411672] [RANK:0] loading tokenizer... /home/owen/models/Meta-Llama-3-70B-Instruct-abliterated-v3.5
[2024-07-26 04:51:25,291] [DEBUG] [axolotl.load_tokenizer:280] [PID:2411672] [RANK:0] EOS: 128009 / <|eot_id|>
[2024-07-26 04:51:25,291] [DEBUG] [axolotl.load_tokenizer:281] [PID:2411672] [RANK:0] BOS: 128000 / <|begin_of_text|>
[2024-07-26 04:51:25,291] [DEBUG] [axolotl.load_tokenizer:282] [PID:2411672] [RANK:0] PAD: 128001 / <|end_of_text|>
[2024-07-26 04:51:25,291] [DEBUG] [axolotl.load_tokenizer:283] [PID:2411672] [RANK:0] UNK: None / None
[2024-07-26 04:51:25,291] [INFO] [axolotl.load_tokenizer:294] [PID:2411672] [RANK:0] No Chat template selected. Consider adding a chat template for easier inference.
[2024-07-26 04:51:25,291] [DEBUG] [axolotl.train.train:95] [PID:2411672] [RANK:0] loading model and peft_config...
[2024-07-26 04:51:25,787] [DEBUG] [axolotl.load_tokenizer:280] [PID:2411673] [RANK:1] EOS: 128009 / <|eot_id|>
[2024-07-26 04:51:25,787] [DEBUG] [axolotl.load_tokenizer:281] [PID:2411673] [RANK:1] BOS: 128000 / <|begin_of_text|>
[2024-07-26 04:51:25,787] [DEBUG] [axolotl.load_tokenizer:282] [PID:2411673] [RANK:1] PAD: 128001 / <|end_of_text|>
[2024-07-26 04:51:25,787] [DEBUG] [axolotl.load_tokenizer:283] [PID:2411673] [RANK:1] UNK: None / None
[2024-07-26 04:51:25,787] [INFO] [axolotl.load_tokenizer:294] [PID:2411673] [RANK:1] No Chat template selected. Consider adding a chat template for easier inference.
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████| 30/30 [06:27<00:00, 12.91s/it]
[2024-07-26 04:57:54,251] [INFO] [axolotl.load_model:764] [PID:2411672] [RANK:0] GPU memory usage after model load: 4.929GB (+0.216GB cache, +0.846GB misc)
[2024-07-26 04:57:54,255] [INFO] [axolotl.load_lora:986] [PID:2411672] [RANK:0] found linear modules: ['gate_proj', 'o_proj', 'up_proj', 'down_proj', 'k_proj', 'q_proj', 'v_proj']
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████| 30/30 [06:27<00:00, 12.91s/it]
[2024-07-26 04:57:54,307] [INFO] [axolotl.load_model:764] [PID:2411673] [RANK:1] GPU memory usage after model load: 4.929GB (+0.216GB cache, +0.846GB misc)
[2024-07-26 04:57:54,313] [INFO] [axolotl.load_lora:986] [PID:2411673] [RANK:1] found linear modules: ['q_proj', 'o_proj', 'v_proj', 'up_proj', 'gate_proj', 'k_proj', 'down_proj']
[2024-07-26 04:57:55,889] [INFO] [axolotl.load_model:869] [PID:2411673] [RANK:1] GPU memory usage after adapters: 4.929GB (+0.216GB cache, +0.846GB misc)
/home/owen/miniconda3/envs/axolotl/lib/python3.11/site-packages/transformers/training_args.py:1525: FutureWarning: `evaluation_strategy` is deprecated and will be removed in version 4.46 of 🤗 Transformers. Use `eval_strategy` instead
  warnings.warn(
When using FSDP full shard, instead of using `gradient_checkpointing` in TrainingArguments, please use `activation_checkpointing` in `fsdp_config`. The former introduces a redundant AllGather operation in backward pass. Reference: https://github.com/huggingface/transformers/issues/30404
trainable params: 103,546,880 || all params: 70,657,253,376 || trainable%: 0.1465
[2024-07-26 04:57:56,909] [INFO] [axolotl.load_model:869] [PID:2411672] [RANK:0] GPU memory usage after adapters: 3.917GB (+1.228GB cache, +0.846GB misc)
/home/owen/miniconda3/envs/axolotl/lib/python3.11/site-packages/transformers/training_args.py:1525: FutureWarning: `evaluation_strategy` is deprecated and will be removed in version 4.46 of 🤗 Transformers. Use `eval_strategy` instead
  warnings.warn(
When using FSDP full shard, instead of using `gradient_checkpointing` in TrainingArguments, please use `activation_checkpointing` in `fsdp_config`. The former introduces a redundant AllGather operation in backward pass. Reference: https://github.com/huggingface/transformers/issues/30404
[2024-07-26 04:57:57,591] [INFO] [axolotl.train.train:136] [PID:2411672] [RANK:0] Pre-saving adapter config to ./qlora-llama3-70b
[2024-07-26 04:57:57,709] [INFO] [axolotl.train.train:173] [PID:2411672] [RANK:0] Starting trainer...
/home/owen/miniconda3/envs/axolotl/lib/python3.11/site-packages/accelerate/accelerator.py:1550: UserWarning: Upcasted low precision parameters in Linear because mixed precision turned on in FSDP. Affects: weight.
  warnings.warn(
/home/owen/miniconda3/envs/axolotl/lib/python3.11/site-packages/accelerate/accelerator.py:1556: UserWarning: FSDP upcast of low precision parameters may affect the precision of model checkpoints.
  warnings.warn(
You're using a PreTrainedTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
  0%|                                                                                        | 0/202848 [00:00<?, ?it/s]You're using a PreTrainedTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
[rank0]: Traceback (most recent call last):
[rank0]:   File "<frozen runpy>", line 198, in _run_module_as_main
[rank0]:   File "<frozen runpy>", line 88, in _run_code
[rank0]:   File "/home/owen/axolotl/src/axolotl/cli/train.py", line 72, in <module>
[rank0]:     fire.Fire(do_cli)
[rank0]:   File "/home/owen/miniconda3/envs/axolotl/lib/python3.11/site-packages/fire/core.py", line 143, in Fire
[rank0]:     component_trace = _Fire(component, args, parsed_flag_args, context, name)
[rank0]:                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/owen/miniconda3/envs/axolotl/lib/python3.11/site-packages/fire/core.py", line 477, in _Fire
[rank0]:     component, remaining_args = _CallAndUpdateTrace(
[rank0]:                                 ^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/owen/miniconda3/envs/axolotl/lib/python3.11/site-packages/fire/core.py", line 693, in _CallAndUpdateTrace
[rank0]:     component = fn(*varargs, **kwargs)
[rank0]:                 ^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/owen/axolotl/src/axolotl/cli/train.py", line 39, in do_cli
[rank0]:     return do_train(parsed_cfg, parsed_cli_args)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/owen/axolotl/src/axolotl/cli/train.py", line 67, in do_train
[rank0]:     return train(cfg=cfg, cli_args=cli_args, dataset_meta=dataset_meta)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/owen/axolotl/src/axolotl/train.py", line 187, in train
[rank0]:     trainer.train(resume_from_checkpoint=resume_from_checkpoint)
[rank0]:   File "/home/owen/miniconda3/envs/axolotl/lib/python3.11/site-packages/transformers/trainer.py", line 1938, in train
[rank0]:     return inner_training_loop(
[rank0]:            ^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/owen/miniconda3/envs/axolotl/lib/python3.11/site-packages/transformers/trainer.py", line 2278, in _inner_training_loop
[rank0]:     with self.accelerator.accumulate(model):
[rank0]:   File "/home/owen/miniconda3/envs/axolotl/lib/python3.11/contextlib.py", line 137, in __enter__
[rank0]:     return next(self.gen)
[rank0]:            ^^^^^^^^^^^^^^
[rank0]:   File "/home/owen/miniconda3/envs/axolotl/lib/python3.11/site-packages/accelerate/accelerator.py", line 1069, in accumulate
[rank0]:     cm_stack.enter_context(contextlib.nullcontext() if allow_gradient_sync else self.no_sync(m))
[rank0]:   File "/home/owen/miniconda3/envs/axolotl/lib/python3.11/contextlib.py", line 517, in enter_context
[rank0]:     result = _enter(cm)
[rank0]:              ^^^^^^^^^^
[rank0]:   File "/home/owen/miniconda3/envs/axolotl/lib/python3.11/contextlib.py", line 137, in __enter__
[rank0]:     return next(self.gen)
[rank0]:            ^^^^^^^^^^^^^^
[rank0]:   File "/home/owen/miniconda3/envs/axolotl/lib/python3.11/site-packages/accelerate/accelerator.py", line 950, in no_sync
[rank0]:     with context():
[rank0]:   File "/home/owen/miniconda3/envs/axolotl/lib/python3.11/contextlib.py", line 137, in __enter__
[rank0]:     return next(self.gen)
[rank0]:            ^^^^^^^^^^^^^^
[rank0]:   File "/home/owen/miniconda3/envs/axolotl/lib/python3.11/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 1048, in no_sync
[rank0]:     _lazy_init(self, self)
[rank0]:   File "/home/owen/miniconda3/envs/axolotl/lib/python3.11/site-packages/torch/distributed/fsdp/_runtime_utils.py", line 138, in _lazy_init
[rank0]:     _share_state_and_init_handle_attrs(state, root_module)
[rank0]:   File "/home/owen/miniconda3/envs/axolotl/lib/python3.11/site-packages/torch/distributed/fsdp/_runtime_utils.py", line 178, in _share_state_and_init_handle_attrs
[rank0]:     handle.init_flat_param_attributes()
[rank0]:   File "/home/owen/miniconda3/envs/axolotl/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
[rank0]:     return func(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/owen/miniconda3/envs/axolotl/lib/python3.11/site-packages/torch/distributed/fsdp/_flat_param.py", line 1187, in init_flat_param_attributes
[rank0]:     flat_param._local_shard = flat_param._local_shard.pin_memory()
[rank0]:                               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: RuntimeError: CUDA error: out of memory
[rank0]: CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
[rank0]: For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
[rank0]: Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

  0%|                                                                                        | 0/202848 [00:01<?, ?it/s]
[rank1]: Traceback (most recent call last):
[rank1]:   File "<frozen runpy>", line 198, in _run_module_as_main
[rank1]:   File "<frozen runpy>", line 88, in _run_code
[rank1]:   File "/home/owen/axolotl/src/axolotl/cli/train.py", line 72, in <module>
[rank1]:     fire.Fire(do_cli)
[rank1]:   File "/home/owen/miniconda3/envs/axolotl/lib/python3.11/site-packages/fire/core.py", line 143, in Fire
[rank1]:     component_trace = _Fire(component, args, parsed_flag_args, context, name)
[rank1]:                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/home/owen/miniconda3/envs/axolotl/lib/python3.11/site-packages/fire/core.py", line 477, in _Fire
[rank1]:     component, remaining_args = _CallAndUpdateTrace(
[rank1]:                                 ^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/home/owen/miniconda3/envs/axolotl/lib/python3.11/site-packages/fire/core.py", line 693, in _CallAndUpdateTrace
[rank1]:     component = fn(*varargs, **kwargs)
[rank1]:                 ^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/home/owen/axolotl/src/axolotl/cli/train.py", line 39, in do_cli
[rank1]:     return do_train(parsed_cfg, parsed_cli_args)
[rank1]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/home/owen/axolotl/src/axolotl/cli/train.py", line 67, in do_train
[rank1]:     return train(cfg=cfg, cli_args=cli_args, dataset_meta=dataset_meta)
[rank1]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/home/owen/axolotl/src/axolotl/train.py", line 187, in train
[rank1]:     trainer.train(resume_from_checkpoint=resume_from_checkpoint)
[rank1]:   File "/home/owen/miniconda3/envs/axolotl/lib/python3.11/site-packages/transformers/trainer.py", line 1938, in train
[rank1]:     return inner_training_loop(
[rank1]:            ^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/home/owen/miniconda3/envs/axolotl/lib/python3.11/site-packages/transformers/trainer.py", line 2278, in _inner_training_loop
[rank1]:     with self.accelerator.accumulate(model):
[rank1]:   File "/home/owen/miniconda3/envs/axolotl/lib/python3.11/contextlib.py", line 137, in __enter__
[rank1]:     return next(self.gen)
[rank1]:            ^^^^^^^^^^^^^^
[rank1]:   File "/home/owen/miniconda3/envs/axolotl/lib/python3.11/site-packages/accelerate/accelerator.py", line 1069, in accumulate
[rank1]:     cm_stack.enter_context(contextlib.nullcontext() if allow_gradient_sync else self.no_sync(m))
[rank1]:   File "/home/owen/miniconda3/envs/axolotl/lib/python3.11/contextlib.py", line 517, in enter_context
[rank1]:     result = _enter(cm)
[rank1]:              ^^^^^^^^^^
[rank1]:   File "/home/owen/miniconda3/envs/axolotl/lib/python3.11/contextlib.py", line 137, in __enter__
[rank1]:     return next(self.gen)
[rank1]:            ^^^^^^^^^^^^^^
[rank1]:   File "/home/owen/miniconda3/envs/axolotl/lib/python3.11/site-packages/accelerate/accelerator.py", line 950, in no_sync
[rank1]:     with context():
[rank1]:   File "/home/owen/miniconda3/envs/axolotl/lib/python3.11/contextlib.py", line 137, in __enter__
[rank1]:     return next(self.gen)
[rank1]:            ^^^^^^^^^^^^^^
[rank1]:   File "/home/owen/miniconda3/envs/axolotl/lib/python3.11/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 1048, in no_sync
[rank1]:     _lazy_init(self, self)
[rank1]:   File "/home/owen/miniconda3/envs/axolotl/lib/python3.11/site-packages/torch/distributed/fsdp/_runtime_utils.py", line 138, in _lazy_init
[rank1]:     _share_state_and_init_handle_attrs(state, root_module)
[rank1]:   File "/home/owen/miniconda3/envs/axolotl/lib/python3.11/site-packages/torch/distributed/fsdp/_runtime_utils.py", line 178, in _share_state_and_init_handle_attrs
[rank1]:     handle.init_flat_param_attributes()
[rank1]:   File "/home/owen/miniconda3/envs/axolotl/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
[rank1]:     return func(*args, **kwargs)
[rank1]:            ^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/home/owen/miniconda3/envs/axolotl/lib/python3.11/site-packages/torch/distributed/fsdp/_flat_param.py", line 1187, in init_flat_param_attributes
[rank1]:     flat_param._local_shard = flat_param._local_shard.pin_memory()
[rank1]:                               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: RuntimeError: CUDA error: out of memory
[rank1]: CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
[rank1]: For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
[rank1]: Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

W0726 04:58:17.074000 140703784469568 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 2411673 closing signal SIGTERM
E0726 04:58:17.338000 140703784469568 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: 1) local_rank: 0 (pid: 2411672) of binary: /home/owen/miniconda3/envs/axolotl/bin/python
Traceback (most recent call last):
  File "/home/owen/miniconda3/envs/axolotl/bin/accelerate", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/home/owen/miniconda3/envs/axolotl/lib/python3.11/site-packages/accelerate/commands/accelerate_cli.py", line 48, in main
    args.func(args)
  File "/home/owen/miniconda3/envs/axolotl/lib/python3.11/site-packages/accelerate/commands/launch.py", line 1088, in launch_command
    multi_gpu_launcher(args)
  File "/home/owen/miniconda3/envs/axolotl/lib/python3.11/site-packages/accelerate/commands/launch.py", line 733, in multi_gpu_launcher
    distrib_run.run(args)
  File "/home/owen/miniconda3/envs/axolotl/lib/python3.11/site-packages/torch/distributed/run.py", line 870, in run
    elastic_launch(
  File "/home/owen/miniconda3/envs/axolotl/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/owen/miniconda3/envs/axolotl/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 263, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
axolotl.cli.train FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-07-26_04:58:17
  host      : COMPUTE-PC.
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 2411672)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Ok I tried training 8B on my 2x3090 Ubuntu Inference machine and it also OOM on deepspeed 8192ctx. So I don't think it is a WSL issue for that. Deepspeed zero3 just doesn't seem to shard the model at all on axolotl.

HOWEVER, FSDP on Ubuntu works perfectly. I can load 8B LORA with 8192ctx. I haven't tried 70B training because I only have 64GB of RAM on this machine. I guess FSDP is broken through WSL?

2x24GB 8B LORA 8192 Deepspeed nvtop:

Screenshot 2024-07-26 052557

2x24GB 8B LORA 8192 Deepspeed traceback:

accelerate launch -m axolotl.cli.train lora-sft.yml 
The following values were not passed to `accelerate launch` and had defaults used instead:
        `--num_processes` was set to a value of `2`
                More than one GPU was found, enabling multi-GPU training.
                If this was unintended please pass in `--num_processes=1`.
        `--num_machines` was set to a value of `1`
        `--mixed_precision` was set to a value of `'no'`
        `--dynamo_backend` was set to a value of `'no'`
To avoid this warning pass in values for each of the problematic parameters or run `accelerate config`.
[2024-07-26 05:21:51,570] [INFO] [datasets.<module>:58] [PID:9298] PyTorch version 2.3.1 available.
[2024-07-26 05:21:51,570] [INFO] [datasets.<module>:58] [PID:9297] PyTorch version 2.3.1 available.
[2024-07-26 05:21:52,428] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-07-26 05:21:52,429] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-07-26 05:21:52,478] [INFO] [root.spawn:38] [PID:9298] gcc -pthread -B /home/arli/miniconda3/envs/axolotl/compiler_compat -DNDEBUG -fwrapv -O2 -Wall -fPIC -O2 -isystem /home/arli/miniconda3/envs/axolotl/include -fPIC -O2 -isystem /home/arli/miniconda3/envs/axolotl/include -fPIC -c /tmp/tmp1bunh70o/test.c -o /tmp/tmp1bunh70o/test.o
[2024-07-26 05:21:52,479] [INFO] [root.spawn:38] [PID:9297] gcc -pthread -B /home/arli/miniconda3/envs/axolotl/compiler_compat -DNDEBUG -fwrapv -O2 -Wall -fPIC -O2 -isystem /home/arli/miniconda3/envs/axolotl/include -fPIC -O2 -isystem /home/arli/miniconda3/envs/axolotl/include -fPIC -c /tmp/tmpi_psjqf8/test.c -o /tmp/tmpi_psjqf8/test.o
[2024-07-26 05:21:52,494] [INFO] [root.spawn:38] [PID:9297] gcc -pthread -B /home/arli/miniconda3/envs/axolotl/compiler_compat /tmp/tmpi_psjqf8/test.o -laio -o /tmp/tmpi_psjqf8/a.out
[2024-07-26 05:21:52,495] [INFO] [root.spawn:38] [PID:9298] gcc -pthread -B /home/arli/miniconda3/envs/axolotl/compiler_compat /tmp/tmp1bunh70o/test.o -laio -o /tmp/tmp1bunh70o/a.out
 [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.
 [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.
 [WARNING]  async_io: please install the libaio-dev package with apt
 [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
 [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
 [WARNING]  async_io: please install the libaio-dev package with apt
 [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
 [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
 [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.3
 [WARNING]  using untested triton version (2.3.1), only 1.0.0 is known to be compatible
 [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.3
 [WARNING]  using untested triton version (2.3.1), only 1.0.0 is known to be compatible
[2024-07-26 05:21:53,693] [INFO] [axolotl.utils.config.models.input.check_eval_packing:961] [PID:9298] [RANK:1] setting `remove_unused_columns: false` for when sample_packing and eval_sample_packing don't match
[2024-07-26 05:21:53,694] [WARNING] [axolotl.utils.config.models.input.hint_lora_8bit:1047] [PID:9298] [RANK:1] We recommend setting `load_in_8bit: true` for LORA finetuning
[2024-07-26 05:21:53,694] [DEBUG] [axolotl.normalize_config:80] [PID:9298] [RANK:1] bf16 support detected, enabling for this configuration.
[2024-07-26 05:21:53,696] [INFO] [axolotl.normalize_config:183] [PID:9298] [RANK:1] GPU memory usage baseline: 0.000GB (+0.329GB misc)
[2024-07-26 05:21:53,698] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-07-26 05:21:53,702] [INFO] [axolotl.utils.config.models.input.check_eval_packing:961] [PID:9297] [RANK:0] setting `remove_unused_columns: false` for when sample_packing and eval_sample_packing don't match
[2024-07-26 05:21:53,703] [WARNING] [axolotl.utils.config.models.input.hint_lora_8bit:1047] [PID:9297] [RANK:0] We recommend setting `load_in_8bit: true` for LORA finetuning
[2024-07-26 05:21:53,703] [DEBUG] [axolotl.normalize_config:80] [PID:9297] [RANK:0] bf16 support detected, enabling for this configuration.
[2024-07-26 05:21:53,734] [INFO] [axolotl.normalize_config:183] [PID:9297] [RANK:0] GPU memory usage baseline: 0.000GB (+0.470GB misc)
[2024-07-26 05:21:53,736] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-07-26 05:21:53,736] [INFO] [comm.py:668:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
                                 dP            dP   dP 
                                 88            88   88 
      .d8888b. dP.  .dP .d8888b. 88 .d8888b. d8888P 88 
      88'  `88  `8bd8'  88'  `88 88 88'  `88   88   88 
      88.  .88  .d88b.  88.  .88 88 88.  .88   88   88 
      `88888P8 dP'  `dP `88888P' dP `88888P'   dP   dP 
                                                       
                                                       

****************************************
**** Axolotl Dependency Versions *****
  accelerate: 0.32.0         
        peft: 0.11.1         
transformers: 4.43.1         
         trl: 0.9.6          
       torch: 2.3.1          
bitsandbytes: 0.43.1         
****************************************
[2024-07-26 05:21:53,794] [WARNING] [axolotl.scripts.check_user_token:487] [PID:9297] [RANK:0] Error verifying HuggingFace token. Remember to log in using `huggingface-cli login` and get your access token from https://huggingface.co/settings/tokens if you want to use gated models or datasets.
[2024-07-26 05:21:53,831] [WARNING] [axolotl.scripts.check_user_token:487] [PID:9298] [RANK:1] Error verifying HuggingFace token. Remember to log in using `huggingface-cli login` and get your access token from https://huggingface.co/settings/tokens if you want to use gated models or datasets.
[2024-07-26 05:21:54,061] [DEBUG] [axolotl.load_tokenizer:280] [PID:9297] [RANK:0] EOS: 128009 / <|eot_id|>
[2024-07-26 05:21:54,061] [DEBUG] [axolotl.load_tokenizer:281] [PID:9297] [RANK:0] BOS: 128000 / <|begin_of_text|>
[2024-07-26 05:21:54,061] [DEBUG] [axolotl.load_tokenizer:282] [PID:9297] [RANK:0] PAD: 128001 / <|end_of_text|>
[2024-07-26 05:21:54,061] [DEBUG] [axolotl.load_tokenizer:283] [PID:9297] [RANK:0] UNK: None / None
[2024-07-26 05:21:54,061] [INFO] [axolotl.load_tokenizer:294] [PID:9297] [RANK:0] No Chat template selected. Consider adding a chat template for easier inference.
[2024-07-26 05:21:54,062] [INFO] [axolotl.load_tokenized_prepared_datasets:183] [PID:9297] [RANK:0] Unable to find prepared dataset in lora_last_run_prepared/68245a345f02c2d741f9742cfff757d0
[2024-07-26 05:21:54,062] [INFO] [axolotl.load_tokenized_prepared_datasets:184] [PID:9297] [RANK:0] Loading raw datasets...
[2024-07-26 05:21:54,062] [WARNING] [axolotl.load_tokenized_prepared_datasets:186] [PID:9297] [RANK:0] Processing datasets during training can lead to VRAM instability. Please pre-process your dataset.
[2024-07-26 05:21:54,062] [INFO] [axolotl.load_tokenized_prepared_datasets:193] [PID:9297] [RANK:0] No seed provided, using default seed of 42
[2024-07-26 05:21:54,090] [DEBUG] [axolotl.load_tokenizer:280] [PID:9298] [RANK:1] EOS: 128009 / <|eot_id|>
[2024-07-26 05:21:54,090] [DEBUG] [axolotl.load_tokenizer:281] [PID:9298] [RANK:1] BOS: 128000 / <|begin_of_text|>
[2024-07-26 05:21:54,090] [DEBUG] [axolotl.load_tokenizer:282] [PID:9298] [RANK:1] PAD: 128001 / <|end_of_text|>
[2024-07-26 05:21:54,090] [DEBUG] [axolotl.load_tokenizer:283] [PID:9298] [RANK:1] UNK: None / None
[2024-07-26 05:21:54,090] [INFO] [axolotl.load_tokenizer:294] [PID:9298] [RANK:1] No Chat template selected. Consider adding a chat template for easier inference.
[2024-07-26 05:21:55,304] [INFO] [axolotl.get_dataset_wrapper:540] [PID:9297] [RANK:0] Loading dataset with base_type: sharegpt and prompt_style: None
Tokenizing Prompts (num_proc=16): 100%|██████████████████████████████████████████| 1084/1084 [00:02<00:00, 407.97 examples/s][2024-07-26 05:21:58,354] [INFO] [axolotl.load_tokenized_prepared_datasets:414] [PID:9297] [RANK:0] merging datasets
Dropping Long Sequences (num_proc=16): 100%|████████████████████████████████████| 1084/1084 [00:00<00:00, 7063.84 examples/s]Add position_id column (Sample Packing) (num_proc=16): 100%|████████████████████| 1084/1084 [00:00<00:00, 6151.94 examples/s][2024-07-26 05:21:59,288] [INFO] [axolotl.load_tokenized_prepared_datasets:427] [PID:9297] [RANK:0] Saving merged prepared dataset to disk... lora_last_run_prepared/68245a345f02c2d741f9742cfff757d0
[2024-07-26 05:21:59,289] [INFO] [axolotl.load_tokenized_prepared_datasets:183] [PID:9298] [RANK:1] Unable to find prepared dataset in lora_last_run_prepared/68245a345f02c2d741f9742cfff757d0
[2024-07-26 05:21:59,289] [INFO] [axolotl.load_tokenized_prepared_datasets:184] [PID:9298] [RANK:1] Loading raw datasets...
[2024-07-26 05:21:59,289] [WARNING] [axolotl.load_tokenized_prepared_datasets:186] [PID:9298] [RANK:1] Processing datasets during training can lead to VRAM instability. Please pre-process your dataset.
[2024-07-26 05:21:59,289] [INFO] [axolotl.load_tokenized_prepared_datasets:193] [PID:9298] [RANK:1] No seed provided, using default seed of 42
Saving the dataset (1/1 shards): 100%|█████████████████████████████████████████| 1084/1084 [00:00<00:00, 78091.19 examples/s][2024-07-26 05:22:00,299] [INFO] [axolotl.get_dataset_wrapper:540] [PID:9298] [RANK:1] Loading dataset with base_type: sharegpt and prompt_style: None
[2024-07-26 05:22:00,476] [INFO] [axolotl.load_tokenized_prepared_datasets:414] [PID:9298] [RANK:1] merging datasets
[2024-07-26 05:22:00,481] [DEBUG] [axolotl.calculate_total_num_steps:297] [PID:9297] [RANK:0] total_num_tokens: 297_543
[2024-07-26 05:22:00,489] [DEBUG] [axolotl.calculate_total_num_steps:310] [PID:9297] [RANK:0] `total_supervised_tokens: 13_934`
[2024-07-26 05:22:04,552] [INFO] [axolotl.utils.samplers.multipack._len_est:185] [PID:9297] [RANK:0] packing_efficiency_estimate: 1.0 total_num_tokens per device: 148771
[2024-07-26 05:22:04,552] [DEBUG] [axolotl.calculate_total_num_steps:362] [PID:9297] [RANK:0] data_loader_len: 0
[2024-07-26 05:22:04,624] [INFO] [axolotl.utils.samplers.multipack._len_est:185] [PID:9298] [RANK:1] packing_efficiency_estimate: 1.0 total_num_tokens per device: 148771
[2024-07-26 05:22:04,655] [INFO] [axolotl.calc_sample_packing_eff_est:368] [PID:9297] [RANK:0] sample_packing_eff_est across ranks: [0.9558202028274536, 0.9816531538963318]
[2024-07-26 05:22:04,656] [DEBUG] [axolotl.calculate_total_num_steps:380] [PID:9297] [RANK:0] sample_packing_eff_est: 0.99
[2024-07-26 05:22:04,656] [DEBUG] [axolotl.calculate_total_num_steps:388] [PID:9297] [RANK:0] total_num_steps: 0
[2024-07-26 05:22:04,682] [DEBUG] [axolotl.train.train:66] [PID:9297] [RANK:0] loading tokenizer... /home/arli/models/Meta-Llama-3.1-8B-Instruct
[2024-07-26 05:22:04,944] [DEBUG] [axolotl.load_tokenizer:280] [PID:9298] [RANK:1] EOS: 128009 / <|eot_id|>
[2024-07-26 05:22:04,944] [DEBUG] [axolotl.load_tokenizer:281] [PID:9298] [RANK:1] BOS: 128000 / <|begin_of_text|>
[2024-07-26 05:22:04,944] [DEBUG] [axolotl.load_tokenizer:282] [PID:9298] [RANK:1] PAD: 128001 / <|end_of_text|>
[2024-07-26 05:22:04,944] [DEBUG] [axolotl.load_tokenizer:283] [PID:9298] [RANK:1] UNK: None / None
[2024-07-26 05:22:04,944] [INFO] [axolotl.load_tokenizer:294] [PID:9298] [RANK:1] No Chat template selected. Consider adding a chat template for easier inference.
[2024-07-26 05:22:04,964] [DEBUG] [axolotl.load_tokenizer:280] [PID:9297] [RANK:0] EOS: 128009 / <|eot_id|>
[2024-07-26 05:22:04,964] [DEBUG] [axolotl.load_tokenizer:281] [PID:9297] [RANK:0] BOS: 128000 / <|begin_of_text|>
[2024-07-26 05:22:04,964] [DEBUG] [axolotl.load_tokenizer:282] [PID:9297] [RANK:0] PAD: 128001 / <|end_of_text|>
[2024-07-26 05:22:04,965] [DEBUG] [axolotl.load_tokenizer:283] [PID:9297] [RANK:0] UNK: None / None
[2024-07-26 05:22:04,965] [INFO] [axolotl.load_tokenizer:294] [PID:9297] [RANK:0] No Chat template selected. Consider adding a chat template for easier inference.
[2024-07-26 05:22:04,965] [DEBUG] [axolotl.train.train:95] [PID:9297] [RANK:0] loading model and peft_config...
You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`.
You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`.
[2024-07-26 05:22:05,767] [INFO] [partition_parameters.py:345:__exit__] finished initializing model - num_params = 291, num_elems = 8.03B
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████| 4/4 [00:02<00:00,  1.82it/s][2024-07-26 05:22:08,020] [INFO] [axolotl.load_model:764] [PID:9298] [RANK:1] GPU memory usage after model load: 7.481GB (+14.582GB cache, +0.834GB misc)
[2024-07-26 05:22:08,023] [INFO] [axolotl.load_model:824] [PID:9298] [RANK:1] converting modules to torch.bfloat16 for flash attention
[2024-07-26 05:22:08,025] [INFO] [axolotl.load_lora:986] [PID:9298] [RANK:1] found linear modules: ['o_proj', 'v_proj', 'gate_proj', 'down_proj', 'q_proj', 'k_proj', 'up_proj']
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████| 4/4 [00:06<00:00,  1.56s/it][2024-07-26 05:22:12,070] [INFO] [axolotl.load_model:764] [PID:9297] [RANK:0] GPU memory usage after model load: 7.481GB (+2.125GB cache, +1.006GB misc)
[2024-07-26 05:22:12,073] [INFO] [axolotl.load_model:824] [PID:9297] [RANK:0] converting modules to torch.bfloat16 for flash attention
[2024-07-26 05:22:12,075] [INFO] [axolotl.load_lora:986] [PID:9297] [RANK:0] found linear modules: ['gate_proj', 'o_proj', 'k_proj', 'v_proj', 'q_proj', 'down_proj', 'up_proj']
[2024-07-26 05:22:13,304] [INFO] [axolotl.load_model:869] [PID:9298] [RANK:1] GPU memory usage after adapters: 7.795GB (+14.416GB cache, +0.834GB misc)
trainable params: 167,772,160 || all params: 8,198,033,408 || trainable%: 2.0465
[2024-07-26 05:22:13,509] [INFO] [axolotl.load_model:869] [PID:9297] [RANK:0] GPU memory usage after adapters: 7.795GB (+1.958GB cache, +1.006GB misc)
/home/arli/miniconda3/envs/axolotl/lib/python3.11/site-packages/transformers/training_args.py:1525: FutureWarning: `evaluation_strategy` is deprecated and will be removed in version 4.46 of 🤗 Transformers. Use `eval_strategy` instead
  warnings.warn(
/home/arli/miniconda3/envs/axolotl/lib/python3.11/site-packages/transformers/training_args.py:1525: FutureWarning: `evaluation_strategy` is deprecated and will be removed in version 4.46 of 🤗 Transformers. Use `eval_strategy` instead
  warnings.warn(
[2024-07-26 05:22:14,066] [INFO] [axolotl.train.train:136] [PID:9297] [RANK:0] Pre-saving adapter config to ./lora_out
[2024-07-26 05:22:14,165] [INFO] [axolotl.utils.samplers.multipack._len_est:185] [PID:9298] [RANK:1] packing_efficiency_estimate: 0.99 total_num_tokens per device: 148771
[2024-07-26 05:22:14,165] [INFO] [axolotl.utils.samplers.multipack._len_est:185] [PID:9298] [RANK:1] packing_efficiency_estimate: 0.99 total_num_tokens per device: 148771
[2024-07-26 05:22:14,166] [INFO] [axolotl.utils.samplers.multipack._len_est:185] [PID:9298] [RANK:1] packing_efficiency_estimate: 0.99 total_num_tokens per device: 148771
[2024-07-26 05:22:14,167] [INFO] [axolotl.utils.samplers.multipack._len_est:185] [PID:9298] [RANK:1] packing_efficiency_estimate: 0.99 total_num_tokens per device: 148771
[2024-07-26 05:22:14,192] [INFO] [axolotl.train.train:173] [PID:9297] [RANK:0] Starting trainer...
[2024-07-26 05:22:14,363] [INFO] [axolotl.utils.samplers.multipack._len_est:185] [PID:9297] [RANK:0] packing_efficiency_estimate: 0.99 total_num_tokens per device: 148771
[2024-07-26 05:22:14,363] [INFO] [axolotl.utils.samplers.multipack._len_est:185] [PID:9297] [RANK:0] packing_efficiency_estimate: 0.99 total_num_tokens per device: 148771
[2024-07-26 05:22:14,364] [INFO] [axolotl.utils.samplers.multipack._len_est:185] [PID:9297] [RANK:0] packing_efficiency_estimate: 0.99 total_num_tokens per device: 148771
[2024-07-26 05:22:14,365] [INFO] [axolotl.utils.samplers.multipack._len_est:185] [PID:9297] [RANK:0] packing_efficiency_estimate: 0.99 total_num_tokens per device: 148771
[2024-07-26 05:22:14,450] [WARNING] [engine.py:1188:_do_optimizer_sanity_check] **** You are using ZeRO with an untested optimizer, proceed with caution *****
Parameter Offload: Total persistent parameters: 266240 in 65 params
[2024-07-26 05:22:16,762] [INFO] [axolotl.utils.samplers.multipack._len_est:185] [PID:9298] [RANK:1] packing_efficiency_estimate: 0.99 total_num_tokens per device: 148771
[2024-07-26 05:22:16,762] [INFO] [axolotl.utils.samplers.multipack._len_est:185] [PID:9298] [RANK:1] packing_efficiency_estimate: 0.99 total_num_tokens per device: 148771
You're using a PreTrainedTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
[2024-07-26 05:22:17,592] [INFO] [wandb.__setitem__:151] [PID:9297] config set model/num_parameters = 0 - None
[2024-07-26 05:22:17,619] [INFO] [axolotl.callbacks.on_train_begin:785] [PID:9297] [RANK:0] The Axolotl config has been saved to the WandB run under files.
  0%|                                                                                                  | 0/1 [00:00<?, ?it/s][2024-07-26 05:22:17,620] [INFO] [axolotl.utils.samplers.multipack._len_est:185] [PID:9297] [RANK:0] packing_efficiency_estimate: 0.99 total_num_tokens per device: 148771
[2024-07-26 05:22:17,621] [INFO] [axolotl.utils.samplers.multipack._len_est:185] [PID:9297] [RANK:0] packing_efficiency_estimate: 0.99 total_num_tokens per device: 148771
You're using a PreTrainedTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
[rank1]: Traceback (most recent call last):
[rank1]:   File "<frozen runpy>", line 198, in _run_module_as_main
[rank1]:   File "<frozen runpy>", line 88, in _run_code
[rank1]:   File "/home/arli/axolotl/src/axolotl/cli/train.py", line 72, in <module>
[rank1]:     fire.Fire(do_cli)
[rank1]:   File "/home/arli/miniconda3/envs/axolotl/lib/python3.11/site-packages/fire/core.py", line 143, in Fire
[rank1]:     component_trace = _Fire(component, args, parsed_flag_args, context, name)
[rank1]:                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/home/arli/miniconda3/envs/axolotl/lib/python3.11/site-packages/fire/core.py", line 477, in _Fire
[rank1]:     component, remaining_args = _CallAndUpdateTrace(
[rank1]:                                 ^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/home/arli/miniconda3/envs/axolotl/lib/python3.11/site-packages/fire/core.py", line 693, in _CallAndUpdateTrace
[rank1]:     component = fn(*varargs, **kwargs)
[rank1]:                 ^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/home/arli/axolotl/src/axolotl/cli/train.py", line 39, in do_cli
[rank1]:     return do_train(parsed_cfg, parsed_cli_args)
[rank1]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/home/arli/axolotl/src/axolotl/cli/train.py", line 67, in do_train
[rank1]:     return train(cfg=cfg, cli_args=cli_args, dataset_meta=dataset_meta)
[rank1]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/home/arli/axolotl/src/axolotl/train.py", line 187, in train
[rank1]:     trainer.train(resume_from_checkpoint=resume_from_checkpoint)
[rank1]:   File "/home/arli/miniconda3/envs/axolotl/lib/python3.11/site-packages/transformers/trainer.py", line 1938, in train
[rank1]:     return inner_training_loop(
[rank1]:            ^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/home/arli/miniconda3/envs/axolotl/lib/python3.11/site-packages/transformers/trainer.py", line 2279, in _inner_training_loop
[rank1]:     tr_loss_step = self.training_step(model, inputs)
[rank1]:                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/home/arli/miniconda3/envs/axolotl/lib/python3.11/site-packages/transformers/trainer.py", line 3349, in training_step
[rank1]:     self.accelerator.backward(loss, **kwargs)
[rank1]:   File "/home/arli/miniconda3/envs/axolotl/lib/python3.11/site-packages/accelerate/accelerator.py", line 2143, in backward
[rank1]:     self.deepspeed_engine_wrapped.backward(loss, **kwargs)
[rank1]:   File "/home/arli/miniconda3/envs/axolotl/lib/python3.11/site-packages/accelerate/utils/deepspeed.py", line 166, in backward
[rank1]:     self.engine.backward(loss, **kwargs)
[rank1]:   File "/home/arli/miniconda3/envs/axolotl/lib/python3.11/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
[rank1]:     ret_val = func(*args, **kwargs)
[rank1]:               ^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/home/arli/miniconda3/envs/axolotl/lib/python3.11/site-packages/deepspeed/runtime/engine.py", line 1976, in backward
[rank1]:     self.optimizer.backward(loss, retain_graph=retain_graph)
[rank1]:   File "/home/arli/miniconda3/envs/axolotl/lib/python3.11/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
[rank1]:     ret_val = func(*args, **kwargs)
[rank1]:               ^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/home/arli/miniconda3/envs/axolotl/lib/python3.11/site-packages/deepspeed/runtime/zero/stage3.py", line 2213, in backward
[rank1]:     self.loss_scaler.backward(loss.float(), retain_graph=retain_graph)
[rank1]:   File "/home/arli/miniconda3/envs/axolotl/lib/python3.11/site-packages/deepspeed/runtime/fp16/loss_scaler.py", line 63, in backward
[rank1]:     scaled_loss.backward(retain_graph=retain_graph)
[rank1]:   File "/home/arli/miniconda3/envs/axolotl/lib/python3.11/site-packages/torch/_tensor.py", line 525, in backward
[rank1]:     torch.autograd.backward(
[rank1]:   File "/home/arli/miniconda3/envs/axolotl/lib/python3.11/site-packages/torch/autograd/__init__.py", line 267, in backward
[rank1]:     _engine_run_backward(
[rank1]:   File "/home/arli/miniconda3/envs/axolotl/lib/python3.11/site-packages/torch/autograd/graph.py", line 744, in _engine_run_backward
[rank1]:     return Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
[rank1]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 3.91 GiB. GPU  has a total capacity of 23.68 GiB of which 3.38 GiB is free. Including non-PyTorch memory, this process has 20.29 GiB memory in use. Of the allocated memory 19.16 GiB is allocated by PyTorch, and 544.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/home/arli/axolotl/src/axolotl/cli/train.py", line 72, in <module>
    fire.Fire(do_cli)
  File "/home/arli/miniconda3/envs/axolotl/lib/python3.11/site-packages/fire/core.py", line 143, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/arli/miniconda3/envs/axolotl/lib/python3.11/site-packages/fire/core.py", line 477, in _Fire
    component, remaining_args = _CallAndUpdateTrace(
                                ^^^^^^^^^^^^^^^^^^^^
  File "/home/arli/miniconda3/envs/axolotl/lib/python3.11/site-packages/fire/core.py", line 693, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
                ^^^^^^^^^^^^^^^^^^^^^^
  File "/home/arli/axolotl/src/axolotl/cli/train.py", line 39, in do_cli
    return do_train(parsed_cfg, parsed_cli_args)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/arli/axolotl/src/axolotl/cli/train.py", line 67, in do_train
    return train(cfg=cfg, cli_args=cli_args, dataset_meta=dataset_meta)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/arli/axolotl/src/axolotl/train.py", line 187, in train
    trainer.train(resume_from_checkpoint=resume_from_checkpoint)
  File "/home/arli/miniconda3/envs/axolotl/lib/python3.11/site-packages/transformers/trainer.py", line 1938, in train
    return inner_training_loop(
           ^^^^^^^^^^^^^^^^^^^^
  File "/home/arli/miniconda3/envs/axolotl/lib/python3.11/site-packages/transformers/trainer.py", line 2279, in _inner_training_loop
    tr_loss_step = self.training_step(model, inputs)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/arli/miniconda3/envs/axolotl/lib/python3.11/site-packages/transformers/trainer.py", line 3349, in training_step    self.accelerator.backward(loss, **kwargs)
  File "/home/arli/miniconda3/envs/axolotl/lib/python3.11/site-packages/accelerate/accelerator.py", line 2143, in backward
    self.deepspeed_engine_wrapped.backward(loss, **kwargs)
  File "/home/arli/miniconda3/envs/axolotl/lib/python3.11/site-packages/accelerate/utils/deepspeed.py", line 166, in backward    self.engine.backward(loss, **kwargs)
  File "/home/arli/miniconda3/envs/axolotl/lib/python3.11/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
    ret_val = func(*args, **kwargs)
              ^^^^^^^^^^^^^^^^^^^^^
  File "/home/arli/miniconda3/envs/axolotl/lib/python3.11/site-packages/deepspeed/runtime/engine.py", line 1976, in backward
    self.optimizer.backward(loss, retain_graph=retain_graph)
  File "/home/arli/miniconda3/envs/axolotl/lib/python3.11/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
    ret_val = func(*args, **kwargs)
              ^^^^^^^^^^^^^^^^^^^^^
  File "/home/arli/miniconda3/envs/axolotl/lib/python3.11/site-packages/deepspeed/runtime/zero/stage3.py", line 2213, in backward
    self.loss_scaler.backward(loss.float(), retain_graph=retain_graph)
  File "/home/arli/miniconda3/envs/axolotl/lib/python3.11/site-packages/deepspeed/runtime/fp16/loss_scaler.py", line 63, in backward
    scaled_loss.backward(retain_graph=retain_graph)
  File "/home/arli/miniconda3/envs/axolotl/lib/python3.11/site-packages/torch/_tensor.py", line 525, in backward
    torch.autograd.backward(
  File "/home/arli/miniconda3/envs/axolotl/lib/python3.11/site-packages/torch/autograd/__init__.py", line 267, in backward
    _engine_run_backward(
  File "/home/arli/miniconda3/envs/axolotl/lib/python3.11/site-packages/torch/autograd/graph.py", line 744, in _engine_run_backward
    return Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 3.91 GiB. GPU 
[rank0]: Traceback (most recent call last):
[rank0]:   File "<frozen runpy>", line 198, in _run_module_as_main
[rank0]:   File "<frozen runpy>", line 88, in _run_code
[rank0]:   File "/home/arli/axolotl/src/axolotl/cli/train.py", line 72, in <module>
[rank0]:     fire.Fire(do_cli)
[rank0]:   File "/home/arli/miniconda3/envs/axolotl/lib/python3.11/site-packages/fire/core.py", line 143, in Fire
[rank0]:     component_trace = _Fire(component, args, parsed_flag_args, context, name)
[rank0]:                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/arli/miniconda3/envs/axolotl/lib/python3.11/site-packages/fire/core.py", line 477, in _Fire
[rank0]:     component, remaining_args = _CallAndUpdateTrace(
[rank0]:                                 ^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/arli/miniconda3/envs/axolotl/lib/python3.11/site-packages/fire/core.py", line 693, in _CallAndUpdateTrace
[rank0]:     component = fn(*varargs, **kwargs)
[rank0]:                 ^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/arli/axolotl/src/axolotl/cli/train.py", line 39, in do_cli
[rank0]:     return do_train(parsed_cfg, parsed_cli_args)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/arli/axolotl/src/axolotl/cli/train.py", line 67, in do_train
[rank0]:     return train(cfg=cfg, cli_args=cli_args, dataset_meta=dataset_meta)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/arli/axolotl/src/axolotl/train.py", line 187, in train
[rank0]:     trainer.train(resume_from_checkpoint=resume_from_checkpoint)
[rank0]:   File "/home/arli/miniconda3/envs/axolotl/lib/python3.11/site-packages/transformers/trainer.py", line 1938, in train
[rank0]:     return inner_training_loop(
[rank0]:            ^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/arli/miniconda3/envs/axolotl/lib/python3.11/site-packages/transformers/trainer.py", line 2279, in _inner_training_loop
[rank0]:     tr_loss_step = self.training_step(model, inputs)
[rank0]:                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/arli/miniconda3/envs/axolotl/lib/python3.11/site-packages/transformers/trainer.py", line 3349, in training_step
[rank0]:     self.accelerator.backward(loss, **kwargs)
[rank0]:   File "/home/arli/miniconda3/envs/axolotl/lib/python3.11/site-packages/accelerate/accelerator.py", line 2143, in backward
[rank0]:     self.deepspeed_engine_wrapped.backward(loss, **kwargs)
[rank0]:   File "/home/arli/miniconda3/envs/axolotl/lib/python3.11/site-packages/accelerate/utils/deepspeed.py", line 166, in backward
[rank0]:     self.engine.backward(loss, **kwargs)
[rank0]:   File "/home/arli/miniconda3/envs/axolotl/lib/python3.11/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
[rank0]:     ret_val = func(*args, **kwargs)
[rank0]:               ^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/arli/miniconda3/envs/axolotl/lib/python3.11/site-packages/deepspeed/runtime/engine.py", line 1976, in backward
[rank0]:     self.optimizer.backward(loss, retain_graph=retain_graph)
[rank0]:   File "/home/arli/miniconda3/envs/axolotl/lib/python3.11/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
[rank0]:     ret_val = func(*args, **kwargs)
[rank0]:               ^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/arli/miniconda3/envs/axolotl/lib/python3.11/site-packages/deepspeed/runtime/zero/stage3.py", line 2213, in backward
[rank0]:     self.loss_scaler.backward(loss.float(), retain_graph=retain_graph)
[rank0]:   File "/home/arli/miniconda3/envs/axolotl/lib/python3.11/site-packages/deepspeed/runtime/fp16/loss_scaler.py", line 63, in backward
[rank0]:     scaled_loss.backward(retain_graph=retain_graph)
[rank0]:   File "/home/arli/miniconda3/envs/axolotl/lib/python3.11/site-packages/torch/_tensor.py", line 525, in backward
[rank0]:     torch.autograd.backward(
[rank0]:   File "/home/arli/miniconda3/envs/axolotl/lib/python3.11/site-packages/torch/autograd/__init__.py", line 267, in backward
[rank0]:     _engine_run_backward(
[rank0]:   File "/home/arli/miniconda3/envs/axolotl/lib/python3.11/site-packages/torch/autograd/graph.py", line 744, in _engine_run_backward
[rank0]:     return Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 3.91 GiB. GPU 
W0726 05:22:29.312000 131213135434816 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 9297 closing signal SIGTERM
E0726 05:22:29.476000 131213135434816 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: 1) local_rank: 1 (pid: 9298) of binary: /home/arli/miniconda3/envs/axolotl/bin/python
Traceback (most recent call last):
  File "/home/arli/miniconda3/envs/axolotl/bin/accelerate", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/home/arli/miniconda3/envs/axolotl/lib/python3.11/site-packages/accelerate/commands/accelerate_cli.py", line 48, in main
    args.func(args)
  File "/home/arli/miniconda3/envs/axolotl/lib/python3.11/site-packages/accelerate/commands/launch.py", line 1088, in launch_command
    multi_gpu_launcher(args)
  File "/home/arli/miniconda3/envs/axolotl/lib/python3.11/site-packages/accelerate/commands/launch.py", line 733, in multi_gpu_launcher
    distrib_run.run(args)
  File "/home/arli/miniconda3/envs/axolotl/lib/python3.11/site-packages/torch/distributed/run.py", line 870, in run
    elastic_launch(
  File "/home/arli/miniconda3/envs/axolotl/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/arli/miniconda3/envs/axolotl/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 263, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
axolotl.cli.train FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-07-26_05:22:29
  host      : arli-infer-0
  rank      : 1 (local_rank: 1)
  exitcode  : 1 (pid: 9298)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

@Nero10578 I can't find the issue now but I noted this somewhere else last year: deepspeed does not work with quants (e.g. load_in_8bit and load_in_4bit), or at least I've never been able to get it to work...no idea why

@Nero10578 I can't find the issue now but I noted this somewhere else last year: deepspeed does not work with quants (e.g. load_in_8bit and load_in_4bit), or at least I've never been able to get it to work...no idea why

Yea definitely seems to be the case. So only full model training is useful with deepspeed.