modal-labs/llm-finetuning

URGENT: Unable to Train using Modal

Closed this issue · 1 comments

I have logged into huggingface (as you can see) and changed the line modal.Secret.from_name("huggingface") to modal.Secret.from_name("modal-huggingface-testing") in src/common.py. I also have access to Llama-3 on HuggingFace.

This the error I'm getting when I'm trying to train the model using the code base in the repo. It says wrong username/password while I have clearly alreaddy logged into HuggingFace on cli (see "whoami" command output).

I'm fine-tuning LLMs for a project with an urgent deadline. Kindly reply/resolve this asap.


    _|    _|  _|    _|    _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|_|_|_|    _|_|      _|_|_|  _|_|_|_|
    _|    _|  _|    _|  _|        _|          _|    _|_|    _|  _|            _|        _|    _|  _|        _|
    _|_|_|_|  _|    _|  _|  _|_|  _|  _|_|    _|    _|  _|  _|  _|  _|_|      _|_|_|    _|_|_|_|  _|        _|_|_|
    _|    _|  _|    _|  _|    _|  _|    _|    _|    _|    _|_|  _|    _|      _|        _|    _|  _|        _|
    _|    _|    _|_|      _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|        _|    _|    _|_|_|  _|_|_|_|

    A token is already saved on your machine. Run `huggingface-cli whoami` to get more information or `huggingface-cli logout` if you want to log out.
    Setting a new token will erase the existing one.
    To login, `huggingface_hub` requires a token generated from https://huggingface.co/settings/tokens .
Enter your token (input will not be visible): 
Add token as git credential? (Y/n) Y
Token is valid (permission: write).
Your token has been saved in your configured git credential helpers (osxkeychain).
Your token has been saved to /Users/vprateek/.cache/huggingface/token
Login successful
(py3.11) vprateek@Prateeks-MacBook-Pro llm-finetuning-main % huggingface-cli whoami
vprateek
(py3.11) vprateek@Prateeks-MacBook-Pro llm-finetuning-main % modal run --detach src.train --config=config/llama-3.yml --data=data/sqlqa.subsample.jsonl
Note that running a local entrypoint in detached mode only keeps the last triggered Modal function alive after the parent process has been killed or disconnected.
✓ Initialized. View run at https://modal.com/prateek/main/apps/ap-9PGC1Y0MgjIoY2x1FKrQ61
✓ Created objects.
├── 🔨 Created mount PythonPackage:src.train
├── 🔨 Created mount PythonPackage:src.inference
├── 🔨 Created mount PythonPackage:src
├── 🔨 Created function Inference.*.
├── 🔨 Created function train.
├── 🔨 Created function preproc_data.
├── 🔨 Created function merge.
├── 🔨 Created function launch.
├── 🔨 Created function Inference.completion.
├── 🔨 Created function Inference.non_streaming.
└── 🔨 Created web function Inference.web => https://pateek--example-axolotl-inference-web-dev.modal.run
Volume contains NousResearch/Meta-Llama-3-8B.
Preparing training run in /runs/axo-2024-07-15-21-54-24-20e0.
Spawning container for data preprocessing.
Preprocessing data.
WARNING: BNB_CUDA_VERSION=121 environment variable detected; loading libbitsandbytes_cuda121.so.
This can be used to load a bitsandbytes version that is different from the PyTorch CUDA version.
If this was unintended set the BNB_CUDA_VERSION variable to an empty string: export BNB_CUDA_VERSION=
If you use the manual override make sure the right libcudart.so is in your LD_LIBRARY_PATH
For example by adding the following to your .bashrc: export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:<path_to_cuda_dir/lib64

[2024-07-15 21:54:39,294] [INFO] [datasets.<module>:58] [PID:4] PyTorch version 2.3.0+cu121 available.
[2024-07-15 21:54:41,257] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
df: /root/.triton/autotune: No such file or directory
[2024-07-15 21:54:41,454] [INFO] [root.spawn:38] [PID:4] gcc -pthread -B /root/miniconda3/envs/py3.11/compiler_compat -DNDEBUG -fwrapv -O2 -Wall -fPIC -O2 -isystem /root/miniconda3/envs/py3.11/include -fPIC -O2 -isystem /root/miniconda3/envs/py3.11/include -g0 -fPIC -g0 -c /tmp/tmpj8_f3kta/test.c -o /tmp/tmpj8_f3kta/test.o
[2024-07-15 21:54:41,514] [INFO] [root.spawn:38] [PID:4] gcc -pthread -B /root/miniconda3/envs/py3.11/compiler_compat /tmp/tmpj8_f3kta/test.o -laio -o /tmp/tmpj8_f3kta/a.out
 [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
 [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.3
 [WARNING]  using untested triton version (2.3.0), only 1.0.0 is known to be compatible
                                 dP            dP   dP 
                                 88            88   88 
      .d8888b. dP.  .dP .d8888b. 88 .d8888b. d8888P 88 
      88'  `88  `8bd8'  88'  `88 88 88'  `88   88   88 
      88.  .88  .d88b.  88.  .88 88 88.  .88   88   88 
      `88888P8 dP'  `dP `88888P' dP `88888P'   dP   dP 
                                                       
                                                       

****************************************
**** Axolotl Dependency Versions *****
  accelerate: 0.30.1         
        peft: 0.11.1         
transformers: 4.42.3         
         trl: 0.8.7.dev0     
       torch: 2.3.0+cu121    
bitsandbytes: 0.43.1         
****************************************
[2024-07-15 21:54:44,657] [DEBUG] [axolotl.normalize_config:80] [PID:4] [RANK:0] bf16 support detected, enabling for this configuration.
[2024-07-15 21:54:44,888] [INFO] [axolotl.normalize_config:183] [PID:4] [RANK:0] GPU memory usage baseline: 0.000GB (+0.307GB misc)
Traceback (most recent call last):
  File "/root/miniconda3/envs/py3.11/lib/python3.11/site-packages/huggingface_hub/utils/_errors.py", line 304, in hf_raise_for_status
    response.raise_for_status()
  File "/root/miniconda3/envs/py3.11/lib/python3.11/site-packages/requests/models.py", line 1024, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 401 Client Error: Unauthorized for url: https://huggingface.co/api/whoami-v2

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/root/miniconda3/envs/py3.11/lib/python3.11/site-packages/huggingface_hub/hf_api.py", line 1397, in whoami
    hf_raise_for_status(r)
  File "/root/miniconda3/envs/py3.11/lib/python3.11/site-packages/huggingface_hub/utils/_errors.py", line 371, in hf_raise_for_status
    raise HfHubHTTPError(str(e), response=response) from e
huggingface_hub.utils._errors.HfHubHTTPError: 401 Client Error: Unauthorized for url: https://huggingface.co/api/whoami-v2 (Request ID: Root=1-66959aa4-25141c8c4d4de1d64754f466;dc7bd8b2-5b73-4e49-ad22-6ccfa0fb67b7)

Invalid username or password.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/workspace/axolotl/src/axolotl/cli/preprocess.py", line 91, in <module>
    fire.Fire(do_cli)
  File "/root/miniconda3/envs/py3.11/lib/python3.11/site-packages/fire/core.py", line 143, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/envs/py3.11/lib/python3.11/site-packages/fire/core.py", line 477, in _Fire
    component, remaining_args = _CallAndUpdateTrace(
                                ^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/envs/py3.11/lib/python3.11/site-packages/fire/core.py", line 693, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
                ^^^^^^^^^^^^^^^^^^^^^^
  File "/workspace/axolotl/src/axolotl/cli/preprocess.py", line 39, in do_cli
    check_user_token()
  File "/workspace/axolotl/src/axolotl/cli/__init__.py", line 484, in check_user_token
    user_info = api.whoami()
                ^^^^^^^^^^^^
  File "/root/miniconda3/envs/py3.11/lib/python3.11/site-packages/huggingface_hub/utils/_validators.py", line 114, in _inner_fn
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/envs/py3.11/lib/python3.11/site-packages/huggingface_hub/hf_api.py", line 1399, in whoami
    raise HTTPError(
requests.exceptions.HTTPError: Invalid user token. If you didn't pass a user token, make sure you are properly logged in by executing `huggingface-cli login`, and if you did pass a user token, double-check it's correct.
Traceback (most recent call last):
  File "/pkg/modal/_container_io_manager.py", line 503, in handle_input_exception
    yield
  File "/pkg/modal/_container_entrypoint.py", line 383, in run_input_sync
    res = finalized_function.callable(*local_input.args, **local_input.kwargs)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/src/train.py", line 55, in preproc_data
    run_cmd(
  File "/root/src/train.py", line 186, in run_cmd
    exit(exit_code)
  File "<frozen _sitebuiltins>", line 26, in __call__
SystemExit: 1
Traceback (most recent call last):
  File "/pkg/modal/_container_io_manager.py", line 503, in handle_input_exception
    yield
  File "/pkg/modal/_container_entrypoint.py", line 383, in run_input_sync
    res = finalized_function.callable(*local_input.args, **local_input.kwargs)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/src/train.py", line 131, in launch
    preproc_handle.get()
  File "/pkg/synchronicity/synchronizer.py", line 531, in proxy_method
    return wrapped_method(instance, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/pkg/synchronicity/combined_types.py", line 28, in __call__
    raise uc_exc.exc from None
  File "<ta-01J2W67HKMPYXZ2NCR3X9E61N4>:/root/src/train.py", line 55, in preproc_data
  File "<ta-01J2W67HKMPYXZ2NCR3X9E61N4>:/root/src/train.py", line 186, in run_cmd
  File "<frozen _sitebuiltins>", line 26, in __call__
SystemExit: 1```

I was missing the HF_Key on Modal. Perhaps, consider reordering the instructions in the documentation to first retrieve the HuggingFace Key and then creating a secret on Modal.