v143 - GPU Latest, The latest update version upgrading CUDA seems to have caused many errors related to DeepSpeed
lh0x00 opened this issue ยท 0 comments
lh0x00 commented
๐ Bug
The latest update version upgrading CUDA seems to have caused many errors related to DeepSpeed.
I want to ask how I can continue using Cuda 11.8 as before. Choosing Environment seems like a force to keep the Notebook up-to-date without being able to choose any other previous version. ?
To explain further, the system will report an error similar to marcoslucianops/DeepStream-Yolo#229
crt/host_defines.h"
| ^~~~~~~~~~~~~~~~~~~~
compilation terminated.
ninja: build stopped: subcommand failed.
Traceback (most recent call last):
File "/opt/conda/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 2100, in _run_ninja_build
subprocess.run(
File "/opt/conda/lib/python3.10/subprocess.py", line 526, in run
raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/kaggle/working/alignment-handbook/scripts/run_sft_unsloth.py", line 287, in <module>
main()
File "/kaggle/working/alignment-handbook/scripts/run_sft_unsloth.py", line 235, in main
train_result = trainer.train(resume_from_checkpoint=checkpoint)
File "/opt/conda/lib/python3.10/site-packages/trl/trainer/sft_trainer.py", line 323, in train
output = super().train(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/transformers/trainer.py", line 1544, in train
return inner_training_loop(
File "/opt/conda/lib/python3.10/site-packages/transformers/trainer.py", line 1704, in _inner_training_loop
model, self.optimizer, self.lr_scheduler = self.accelerator.prepare(
File "/opt/conda/lib/python3.10/site-packages/accelerate/accelerator.py", line 1280, in prepare
result = self._prepare_deepspeed(*args)
File "/opt/conda/lib/python3.10/site-packages/accelerate/accelerator.py", line 1662, in _prepare_deepspeed
engine, optimizer, _, lr_scheduler = deepspeed.initialize(**kwargs)
File "/opt/conda/lib/python3.10/site-packages/deepspeed/__init__.py", line 171, in initialize
engine = DeepSpeedEngine(args=args,
File "/opt/conda/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 304, in __init__
self._configure_optimizer(optimizer, model_parameters)
File "/opt/conda/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1193, in _configure_optimizer
basic_optimizer = self._configure_basic_optimizer(model_parameters)
File "/opt/conda/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1264, in _configure_basic_optimizer
optimizer = DeepSpeedCPUAdam(model_parameters,
File "/opt/conda/lib/python3.10/site-packages/deepspeed/ops/adam/cpu_adam.py", line 94, in __init__
self.ds_opt_adam = CPUAdamBuilder().load()
File "/opt/conda/lib/python3.10/site-packages/deepspeed/ops/op_builder/builder.py", line 452, in load
return self.jit_load(verbose)
File "/opt/conda/lib/python3.10/site-packages/deepspeed/ops/op_builder/builder.py", line 501, in jit_load
op_module = load(name=self.name,
File "/opt/conda/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1308, in load
return _jit_compile(
File "/opt/conda/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1710, in _jit_compile
_write_ninja_file_and_build_library(
File "/opt/conda/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1823, in _write_ninja_file_and_build_library
_run_ninja_build(
File "/opt/conda/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 2116, in _run_ninja_build
raise RuntimeError(message) from e
RuntimeError: Error building extension 'cpu_adam'
Loading extension module cpu_adam...