Kaggle/docker-python

v143 - GPU Latest, The latest update version upgrading CUDA seems to have caused many errors related to DeepSpeed

lh0x00 opened this issue ยท 0 comments

lh0x00 commented

๐Ÿ› Bug

The latest update version upgrading CUDA seems to have caused many errors related to DeepSpeed.
I want to ask how I can continue using Cuda 11.8 as before. Choosing Environment seems like a force to keep the Notebook up-to-date without being able to choose any other previous version. ?

To explain further, the system will report an error similar to marcoslucianops/DeepStream-Yolo#229

Screenshot 2024-01-31 at 21 13 29
crt/host_defines.h"
      |          ^~~~~~~~~~~~~~~~~~~~
compilation terminated.
ninja: build stopped: subcommand failed.
Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 2100, in _run_ninja_build
    subprocess.run(
  File "/opt/conda/lib/python3.10/subprocess.py", line 526, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/kaggle/working/alignment-handbook/scripts/run_sft_unsloth.py", line 287, in <module>
    main()
  File "/kaggle/working/alignment-handbook/scripts/run_sft_unsloth.py", line 235, in main
    train_result = trainer.train(resume_from_checkpoint=checkpoint)
  File "/opt/conda/lib/python3.10/site-packages/trl/trainer/sft_trainer.py", line 323, in train
    output = super().train(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/transformers/trainer.py", line 1544, in train
    return inner_training_loop(
  File "/opt/conda/lib/python3.10/site-packages/transformers/trainer.py", line 1704, in _inner_training_loop
    model, self.optimizer, self.lr_scheduler = self.accelerator.prepare(
  File "/opt/conda/lib/python3.10/site-packages/accelerate/accelerator.py", line 1280, in prepare
    result = self._prepare_deepspeed(*args)
  File "/opt/conda/lib/python3.10/site-packages/accelerate/accelerator.py", line 1662, in _prepare_deepspeed
    engine, optimizer, _, lr_scheduler = deepspeed.initialize(**kwargs)
  File "/opt/conda/lib/python3.10/site-packages/deepspeed/__init__.py", line 171, in initialize
    engine = DeepSpeedEngine(args=args,
  File "/opt/conda/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 304, in __init__
    self._configure_optimizer(optimizer, model_parameters)
  File "/opt/conda/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1193, in _configure_optimizer
    basic_optimizer = self._configure_basic_optimizer(model_parameters)
  File "/opt/conda/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1264, in _configure_basic_optimizer
    optimizer = DeepSpeedCPUAdam(model_parameters,
  File "/opt/conda/lib/python3.10/site-packages/deepspeed/ops/adam/cpu_adam.py", line 94, in __init__
    self.ds_opt_adam = CPUAdamBuilder().load()
  File "/opt/conda/lib/python3.10/site-packages/deepspeed/ops/op_builder/builder.py", line 452, in load
    return self.jit_load(verbose)
  File "/opt/conda/lib/python3.10/site-packages/deepspeed/ops/op_builder/builder.py", line 501, in jit_load
    op_module = load(name=self.name,
  File "/opt/conda/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1308, in load
    return _jit_compile(
  File "/opt/conda/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1710, in _jit_compile
    _write_ninja_file_and_build_library(
  File "/opt/conda/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1823, in _write_ninja_file_and_build_library
    _run_ninja_build(
  File "/opt/conda/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 2116, in _run_ninja_build
    raise RuntimeError(message) from e
RuntimeError: Error building extension 'cpu_adam'
Loading extension module cpu_adam...