Cuda not working

Question

Cuda not working

guy907223982 opened this issue 10 months ago · 25 comments

CUDA_SETUP: WARNING! libcudart.so not found in any environmental path. Searching /usr/local/cuda/lib64...
CUDA SETUP: CUDA runtime path found: /usr/local/cuda/lib64/libcudart.so
CUDA SETUP: Highest compute capability among GPUs detected: 7.5
CUDA SETUP: Detected CUDA version 122
CUDA SETUP: TODO: compile library for specific version: libbitsandbytes_cuda122.so
CUDA SETUP: Defaulting to libbitsandbytes.so...
CUDA SETUP: CUDA detection failed. Either CUDA driver not installed, CUDA not installed, or you have multiple conflicting CUDA libraries!
CUDA SETUP: If you compiled from source, try again with make CUDA_VERSION=DETECTED_CUDA_VERSION for example, make CUDA_VERSION=113.
╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ /content/kohya-trainer/train_network.py:873 in │
│ │
│ 870 │ args = parser.parse_args() │
│ 871 │ args = train_util.read_config_from_file(args, parser) │
│ 872 │ │
│ ❱ 873 │ train(args) │
│ 874 │
│ │
│ /content/kohya-trainer/train_network.py:262 in train │
│ │
│ 259 │ │ ) │
│ 260 │ │ trainable_params = network.prepare_optimizer_params(args.text_encoder_lr, args.u │
│ 261 │ │
│ ❱ 262 │ optimizer_name, optimizer_args, optimizer = train_util.get_optimizer(args, trainable │
│ 263 │ │
│ 264 │ # dataloaderを準備する │
│ 265 │ # DataLoaderのプロセス数：0はメインプロセスになる │
│ │
│ /content/kohya-trainer/library/train_util.py:2700 in get_optimizer │
│ │
│ 2697 │ │
│ 2698 │ if optimizer_type == "AdamW8bit".lower(): │
│ 2699 │ │ try: │
│ ❱ 2700 │ │ │ import bitsandbytes as bnb │
│ 2701 │ │ except ImportError: │
│ 2702 │ │ │ raise ImportError("No bitsand bytes / bitsandbytesがインストールされていない │
│ 2703 │ │ print(f"use 8-bit AdamW optimizer | {optimizer_kwargs}") │
│ │
│ /usr/local/lib/python3.10/dist-packages/bitsandbytes/init.py:6 in │
│ │
│ 3 # This source code is licensed under the MIT license found in the │
│ 4 # LICENSE file in the root directory of this source tree. │
│ 5 │
│ ❱ 6 from .autograd._functions import ( │
│ 7 │ MatmulLtState, │
│ 8 │ bmm_cublas, │
│ 9 │ matmul, │
│ │
│ /usr/local/lib/python3.10/dist-packages/bitsandbytes/autograd/_functions.py:5 in │
│ │
│ 2 import warnings │
│ 3 │
│ 4 import torch │
│ ❱ 5 import bitsandbytes.functional as F │
│ 6 │
│ 7 from dataclasses import dataclass │
│ 8 from functools import reduce # Required in Python 3 │
│ │
│ /usr/local/lib/python3.10/dist-packages/bitsandbytes/functional.py:13 in │
│ │
│ 10 from typing import Tuple │
│ 11 from torch import Tensor │
│ 12 │
│ ❱ 13 from .cextension import COMPILED_WITH_CUDA, lib │
│ 14 from functools import reduce # Required in Python 3 │
│ 15 │
│ 16 # math.prod not compatible with python < 3.8 │
│ │
│ /usr/local/lib/python3.10/dist-packages/bitsandbytes/cextension.py:41 in │
│ │
│ 38 │ │ return cls._instance │
│ 39 │
│ 40 │
│ ❱ 41 lib = CUDALibrary_Singleton.get_instance().lib │
│ 42 try: │
│ 43 │ lib.cadam32bit_g32 │
│ 44 │ lib.get_context.restype = ct.c_void_p │
│ │
│ /usr/local/lib/python3.10/dist-packages/bitsandbytes/cextension.py:37 in get_instance │
│ │
│ 34 │ def get_instance(cls): │
│ 35 │ │ if cls._instance is None: │
│ 36 │ │ │ cls._instance = cls.new(cls) │
│ ❱ 37 │ │ │ cls._instance.initialize() │
│ 38 │ │ return cls._instance │
│ 39 │
│ 40 │
│ │
│ /usr/local/lib/python3.10/dist-packages/bitsandbytes/cextension.py:27 in initialize │
│ │
│ 24 │ │ │ if not binary_path.exists(): │
│ 25 │ │ │ │ print('CUDA SETUP: CUDA detection failed. Either CUDA driver not install │
│ 26 │ │ │ │ print('CUDA SETUP: If you compiled from source, try again with `make CUD │
│ ❱ 27 │ │ │ │ raise Exception('CUDA SETUP: Setup Failed!') │
│ 28 │ │ │ self.lib = ct.cdll.LoadLibrary(binary_path) │
│ 29 │ │ else: │
│ 30 │ │ │ print(f"CUDA SETUP: Loading binary {binary_path}...") │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
Exception: CUDA SETUP: Setup Failed!
╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ /usr/local/bin/accelerate:8 in │
│ │
│ 5 from accelerate.commands.accelerate_cli import main │
│ 6 if name == 'main': │
│ 7 │ sys.argv[0] = re.sub(r'(-script.pyw|.exe)?$', '', sys.argv[0]) │
│ ❱ 8 │ sys.exit(main()) │
│ 9 │
│ │
│ /usr/local/lib/python3.10/dist-packages/accelerate/commands/accelerate_cli.py:45 in main │
│ │
│ 42 │ │ exit(1) │
│ 43 │ │
│ 44 │ # Run │
│ ❱ 45 │ args.func(args) │
│ 46 │
│ 47 │
│ 48 if name == "main": │
│ │
│ /usr/local/lib/python3.10/dist-packages/accelerate/commands/launch.py:1104 in launch_command │
│ │
│ 1101 │ elif defaults is not None and defaults.compute_environment == ComputeEnvironment.AMA │
│ 1102 │ │ sagemaker_launcher(defaults, args) │
│ 1103 │ else: │
│ ❱ 1104 │ │ simple_launcher(args) │
│ 1105 │
│ 1106 │
│ 1107 def main(): │
│ │
│ /usr/local/lib/python3.10/dist-packages/accelerate/commands/launch.py:567 in simple_launcher │
│ │
│ 564 │ process = subprocess.Popen(cmd, env=current_env) │
│ 565 │ process.wait() │
│ 566 │ if process.returncode != 0: │
│ ❱ 567 │ │ raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd) │
│ 568 │
│ 569 │
│ 570 def multi_gpu_launcher(args): │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
CalledProcessError: Command '['/usr/bin/python3', 'train_network.py',
'--dataset_config=/content/drive/MyDrive/Loras/ACB/dataset_config.toml',
'--config_file=/content/drive/MyDrive/Loras/ACB/training_config.toml']' returned non-zero
exit status 1.

Answer 1 · 2023-12-14T19:36:11.000Z

Perhaps you ran out of GPU time for the week?

Answer 2 · 2023-12-14T19:41:29.000Z

Perhaps you ran out of GPU time for the week?

I didn't think about that

Answer 3 · 2023-12-14T19:42:23.000Z

Same issue here. Definitely not a GPU time issue. Haven't used any in over a month.

Answer 4 · 2023-12-14T19:43:40.000Z

Same issue here. Definitely not a GPU time issue. Haven't used any in over a month.

When did this start happening?

Answer 5 · 2023-12-14T19:46:09.000Z

I've only just encountered it, but then I haven't used the notebook in weeks.

Answer 6 · 2023-12-14T19:52:58.000Z

I can confirm this happens every time starting today. Seems Colab updated their libraries again. Every time they do this it becomes trickier...

I'll take a look

Answer 7 · 2023-12-14T19:53:47.000Z

thank you!

Answer 8 · 2023-12-14T19:54:51.000Z

same issue here today. Used yesterday with no issues.
`===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
For effortless bug reporting copy-paste your error into this form: https://docs.google.com/forms/d/e/1FAIpQLScPB8emS3Thkp66nvqwmjTEgxp8Y9ufuWTzFyr9kJ5AoI47dQ/viewform?usp=sf_link

CUDA_SETUP: WARNING! libcudart.so not found in any environmental path. Searching /usr/local/cuda/lib64...
CUDA SETUP: CUDA runtime path found: /usr/local/cuda/lib64/libcudart.so
CUDA SETUP: Highest compute capability among GPUs detected: 7.5
CUDA SETUP: Detected CUDA version 122
CUDA SETUP: TODO: compile library for specific version: libbitsandbytes_cuda122.so
CUDA SETUP: Defaulting to libbitsandbytes.so...
CUDA SETUP: CUDA detection failed. Either CUDA driver not installed, CUDA not installed, or you have multiple conflicting CUDA libraries!
CUDA SETUP: If you compiled from source, try again with make CUDA_VERSION=DETECTED_CUDA_VERSION for example, make CUDA_VERSION=113.
╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ /content/kohya-trainer/train_network.py:873 in │
│ │
│ 870 │ args = parser.parse_args() │
│ 871 │ args = train_util.read_config_from_file(args, parser) │
│ 872 │ │
│ ❱ 873 │ train(args) │
│ 874 │
│ │
│ /content/kohya-trainer/train_network.py:262 in train │
│ │
│ 259 │ │ ) │
│ 260 │ │ trainable_params = network.prepare_optimizer_params(args.text_encoder_lr, args.u │
│ 261 │ │
│ ❱ 262 │ optimizer_name, optimizer_args, optimizer = train_util.get_optimizer(args, trainable │
│ 263 │ │
│ 264 │ # dataloaderを準備する │
│ 265 │ # DataLoaderのプロセス数：0はメインプロセスになる │
│ │
│ /content/kohya-trainer/library/train_util.py:2700 in get_optimizer │
│ │
│ 2697 │ │
│ 2698 │ if optimizer_type == "AdamW8bit".lower(): │
│ 2699 │ │ try: │
│ ❱ 2700 │ │ │ import bitsandbytes as bnb │
│ 2701 │ │ except ImportError: │
│ 2702 │ │ │ raise ImportError("No bitsand bytes / bitsandbytesがインストールされていない │
│ 2703 │ │ print(f"use 8-bit AdamW optimizer | {optimizer_kwargs}") │
│ │
│ /usr/local/lib/python3.10/dist-packages/bitsandbytes/init.py:6 in │
│ │
│ 3 # This source code is licensed under the MIT license found in the │
│ 4 # LICENSE file in the root directory of this source tree. │
│ 5 │
│ ❱ 6 from .autograd._functions import ( │
│ 7 │ MatmulLtState, │
│ 8 │ bmm_cublas, │
│ 9 │ matmul, │
│ │
│ /usr/local/lib/python3.10/dist-packages/bitsandbytes/autograd/_functions.py:5 in │
│ │
│ 2 import warnings │
│ 3 │
│ 4 import torch │
│ ❱ 5 import bitsandbytes.functional as F │
│ 6 │
│ 7 from dataclasses import dataclass │
│ 8 from functools import reduce # Required in Python 3 │
│ │
│ /usr/local/lib/python3.10/dist-packages/bitsandbytes/functional.py:13 in │
│ │
│ 10 from typing import Tuple │
│ 11 from torch import Tensor │
│ 12 │
│ ❱ 13 from .cextension import COMPILED_WITH_CUDA, lib │
│ 14 from functools import reduce # Required in Python 3 │
│ 15 │
│ 16 # math.prod not compatible with python < 3.8 │
│ │
│ /usr/local/lib/python3.10/dist-packages/bitsandbytes/cextension.py:41 in │
│ │
│ 38 │ │ return cls._instance │
│ 39 │
│ 40 │
│ ❱ 41 lib = CUDALibrary_Singleton.get_instance().lib │
│ 42 try: │
│ 43 │ lib.cadam32bit_g32 │
│ 44 │ lib.get_context.restype = ct.c_void_p │
│ │
│ /usr/local/lib/python3.10/dist-packages/bitsandbytes/cextension.py:37 in get_instance │
│ │
│ 34 │ def get_instance(cls): │
│ 35 │ │ if cls._instance is None: │
│ 36 │ │ │ cls._instance = cls.new(cls) │
│ ❱ 37 │ │ │ cls._instance.initialize() │
│ 38 │ │ return cls._instance │
│ 39 │
│ 40 │
│ │
│ /usr/local/lib/python3.10/dist-packages/bitsandbytes/cextension.py:27 in initialize │
│ │
│ 24 │ │ │ if not binary_path.exists(): │
│ 25 │ │ │ │ print('CUDA SETUP: CUDA detection failed. Either CUDA driver not install │
│ 26 │ │ │ │ print('CUDA SETUP: If you compiled from source, try again with make CUD │ │ ❱ 27 │ │ │ │ raise Exception('CUDA SETUP: Setup Failed!') │ │ 28 │ │ │ self.lib = ct.cdll.LoadLibrary(binary_path) │ │ 29 │ │ else: │ │ 30 │ │ │ print(f"CUDA SETUP: Loading binary {binary_path}...") │ ╰──────────────────────────────────────────────────────────────────────────────────────────────────╯ Exception: CUDA SETUP: Setup Failed! ╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮ │ /usr/local/bin/accelerate:8 in <module> │ │ │ │ 5 from accelerate.commands.accelerate_cli import main │ │ 6 if __name__ == '__main__': │ │ 7 │ sys.argv[0] = re.sub(r'(-script\.pyw|\.exe)?$', '', sys.argv[0]) │ │ ❱ 8 │ sys.exit(main()) │ │ 9 │ │ │ │ /usr/local/lib/python3.10/dist-packages/accelerate/commands/accelerate_cli.py:45 in main │ │ │ │ 42 │ │ exit(1) │ │ 43 │ │ │ 44 │ # Run │ │ ❱ 45 │ args.func(args) │ │ 46 │ │ 47 │ │ 48 if __name__ == "__main__": │ │ │ │ /usr/local/lib/python3.10/dist-packages/accelerate/commands/launch.py:1104 in launch_command │ │ │ │ 1101 │ elif defaults is not None and defaults.compute_environment == ComputeEnvironment.AMA │ │ 1102 │ │ sagemaker_launcher(defaults, args) │ │ 1103 │ else: │ │ ❱ 1104 │ │ simple_launcher(args) │ │ 1105 │ │ 1106 │ │ 1107 def main(): │ │ │ │ /usr/local/lib/python3.10/dist-packages/accelerate/commands/launch.py:567 in simple_launcher │ │ │ │ 564 │ process = subprocess.Popen(cmd, env=current_env) │ │ 565 │ process.wait() │ │ 566 │ if process.returncode != 0: │ │ ❱ 567 │ │ raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd) │ │ 568 │ │ 569 │ │ 570 def multi_gpu_launcher(args): │ ╰──────────────────────────────────────────────────────────────────────────────────────────────────╯ CalledProcessError: Command '['/usr/bin/python3', 'train_network.py', '--dataset_config=/content/drive/MyDrive/Loras/67Impala/dataset_config.toml', '--config_file=/content/drive/MyDrive/Loras/67Impala/training_config.toml']' returned non-zero exit status 1.

Answer 9 · 2023-12-14T21:25:10.000Z

Just as a data point, this was working five hours ago. Best of luck fixing this.

Answer 10 · 2023-12-14T22:39:04.000Z

I can confirm this happens every time starting today. Seems Colab updated their libraries again. Every time they do this it becomes trickier...

I'll take a look

Thanks for putting efforts on this.

Answer 11 · 2023-12-15T02:06:14.000Z

It comes down to this:

CUDA backend failed to initialize: Found CUDA version 12010, but JAX was built against version 12020, which is newer. The copy of CUDA that is installed must be at least as new as the version against which JAX was built. (Set TF_CPP_MIN_LOG_LEVEL=0 and rerun for more info.)

I can't find a way to update CUDA or downgrade JAX properly.

If someone could help, we would all be thankful.

Answer 12 · 2023-12-15T04:25:49.000Z

It comes down to this:

CUDA backend failed to initialize: Found CUDA version 12010, but JAX was built against version 12020, which is newer. The copy of CUDA that is installed must be at least as new as the version against which JAX was built. (Set TF_CPP_MIN_LOG_LEVEL=0 and rerun for more info.)

I can't find a way to update CUDA or downgrade JAX properly.

If someone could help, we would all be thankful.

Couldn't we just downgrade the JAX in terminal? If we have colab pro.

Answer 13 · 2023-12-15T04:39:51.000Z

You don't need the colab pro terminal for that. Just need the right command.

Answer 14 · 2023-12-15T04:59:11.000Z

command

Yeah I just noticed that, the command below is not working:

pip install -U "jax[cuda12_pip]" -f https://storage.googleapis.com/jax-releases/jax_cuda_releases.html

Answer 15 · 2023-12-15T07:59:51.000Z

The program is still not working

Answer 16 · 2023-12-15T10:59:15.000Z

I tried several notes on training LoRA on Colab, and they all had the same problem, regarding CUDA...
If anyone could figure it out, I think it would be a really great thing. 😣

Answer 17 · 2023-12-15T11:38:29.000Z

A friend said it started to work after running this command:

!pip install --upgrade bitsandbytes

Haven't try it myself but i'll share anyways.

Answer 18 · 2023-12-15T11:50:18.000Z

A friend said it started to work after running this command:

!pip install --upgrade bitsandbytes

Haven't try it myself but i'll share anyways.

It works! Thanks a lot for sharing!

Answer 19 · 2023-12-15T11:52:34.000Z

A friend said it started to work after running this command:
!pip install --upgrade bitsandbytes
Haven't try it myself but i'll share anyways.

It works! Thanks a lot for sharing!

Where to put the command?

Answer 20 · 2023-12-15T11:54:01.000Z

A friend said it started to work after running this command:
!pip install --upgrade bitsandbytes
Haven't try it myself but i'll share anyways.

It works! Thanks a lot for sharing!

Where to put the command?

I put it at the bottom of install dependencies function, as attached:

Answer 21 · 2023-12-15T12:12:05.000Z

I can confirm that the suggested addition works as described

Installing collected packages: bitsandbytes
Attempting uninstall: bitsandbytes
Found existing installation: bitsandbytes 0.35.0
Uninstalling bitsandbytes-0.35.0:
Successfully uninstalled bitsandbytes-0.35.0
Successfully installed bitsandbytes-0.41.3.post2

✅ Installation finished in 148 seconds.

Answer 22 · 2023-12-15T12:30:52.000Z

its actually worked

thankyou

Answer 23 · 2023-12-15T14:18:42.000Z

A friend said it started to work after running this command:

!pip install --upgrade bitsandbytes

Haven't try it myself but i'll share anyways.

Thank you lots. I have added the upgraded bitsandbytes version to the requirements. The trainer is working again, no changes needed as of right now.

Answer 24 · 2023-12-16T02:11:55.000Z

I'm still not getting it, what can I do?I'm still not getting it, what can I do?I'm still not getting it, what can I do?I'm still not getting it, what can I do?I'm still not getting it, what can I do?I'm still not getting it, what can I do?

Answer 25 · 2023-12-16T04:14:36.000Z

I'm still not getting it, what can I do?I'm still not getting it, what can I do?I'm still not getting it, what can I do?I'm still not getting it, what can I do?I'm still not getting it, what can I do?I'm still not getting it, what can I do?

No need to do anything anymore, just go to the new lora training colab note link to use it, hollowstrawberry has been updated.