Error while running Finetuning script

Question

Error while running Finetuning script

suresh-pokharel opened this issue 8 months ago · 2 comments

I have been trying to run this fine-tuning script: https://github.com/agemagician/ProtTrans/blob/master/Fine-Tuning/PT5_LoRA_Finetuning_per_residue_class.ipynb

But I am consistently getting this deepspeed-related problem. Any help or leads would be appreciated.

[2024-02-23 15:10:12,270] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the l egacy (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set legacy=False. This should only be set if you u
nderstand what it means, and thoroughly read the reason why this was added as explained in huggingface/transformers#24565
Torch version: 2.2.1+cu121
Cuda version: 12.1
Numpy version: 1.26.4
Pandas version: 2.2.1
Transformers version: 4.38.1
Datasets version: 2.17.1
ProtT5_Classfier
Trainable Parameter: 1208144899
ProtT5_LoRA_Classfier
Trainable Parameter: 2510851

[2024-02-23 15:10:47,190] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-02-23 15:10:47,190] [INFO] [comm.py:668:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
Traceback (most recent call last):
  File "/home/sureshp/ProtTrans/Fine-Tuning/ft_script.py", line 738, in <module>
    tokenizer, model, history = train_per_residue(my_train, my_valid, num_labels=3, batch=1, accum=1, epochs=1, seed=42, gpu=0)
  File "/home/sureshp/ProtTrans/Fine-Tuning/ft_script.py", line 726, in train_per_residue
    trainer.train()
  File "/home/sureshp/ProtTrans/Fine-Tuning/ft-venv/lib/python3.9/site-packages/transformers/trainer.py", line 1624, in train
    return inner_training_loop(
  File "/home/sureshp/ProtTrans/Fine-Tuning/ft-venv/lib/python3.9/site-packages/transformers/trainer.py", line 1779, in _inner_training_loop
    model, self.optimizer, self.lr_scheduler = self.accelerator.prepare(
  File "/home/sureshp/ProtTrans/Fine-Tuning/ft-venv/lib/python3.9/site-packages/accelerate/accelerator.py", line 1220, in prepare
    result = self._prepare_deepspeed(*args)
  File "/home/sureshp/ProtTrans/Fine-Tuning/ft-venv/lib/python3.9/site-packages/accelerate/accelerator.py", line 1605, in _prepare_deepspeed
    engine, optimizer, _, lr_scheduler = deepspeed.initialize(**kwargs)
  File "/home/sureshp/ProtTrans/Fine-Tuning/ft-venv/lib/python3.9/site-packages/deepspeed/__init__.py", line 176, in initialize
    engine = DeepSpeedEngine(args=args,
  File "/home/sureshp/ProtTrans/Fine-Tuning/ft-venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 307, in __init__
    self._configure_optimizer(optimizer, model_parameters)
  File "/home/sureshp/ProtTrans/Fine-Tuning/ft-venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1231, in _configure_optimizer
    basic_optimizer = self._configure_basic_optimizer(model_parameters)
  File "/home/sureshp/ProtTrans/Fine-Tuning/ft-venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1302, in _configure_basic_optimizer
    optimizer = DeepSpeedCPUAdam(model_parameters,
  File "/home/sureshp/ProtTrans/Fine-Tuning/ft-venv/lib/python3.9/site-packages/deepspeed/ops/adam/cpu_adam.py", line 94, in __init__
    self.ds_opt_adam = CPUAdamBuilder().load()
  File "/home/sureshp/ProtTrans/Fine-Tuning/ft-venv/lib/python3.9/site-packages/deepspeed/ops/op_builder/builder.py", line 478, in load
    return self.jit_load(verbose)
  File "/home/sureshp/ProtTrans/Fine-Tuning/ft-venv/lib/python3.9/site-packages/deepspeed/ops/op_builder/builder.py", line 498, in jit_load
    extra_include_paths = [os.path.abspath(self.deepspeed_src_path(path)) for path in self.include_paths()]
  File "/home/sureshp/ProtTrans/Fine-Tuning/ft-venv/lib/python3.9/site-packages/deepspeed/ops/op_builder/cpu_adam.py", line 41, in include_paths
    CUDA_INCLUDE = [os.path.join(torch.utils.cpp_extension.CUDA_HOME, "include")]
  File "/home/sureshp/anaconda3/envs/ft-1/lib/python3.9/posixpath.py", line 76, in join
    a = os.fspath(a)
TypeError: expected str, bytes or os.PathLike object, not NoneType
Exception ignored in: <function DeepSpeedCPUAdam.__del__ at 0x7f8c59d88f70>
Traceback (most recent call last):
  File "/home/sureshp/ProtTrans/Fine-Tuning/ft-venv/lib/python3.9/site-packages/deepspeed/ops/adam/cpu_adam.py", line 102, in __del__
    self.ds_opt_adam.destroy_adam(self.opt_id)
AttributeError: 'DeepSpeedCPUAdam' object has no attribute 'ds_opt_adam'

Answer 1 · 2024-03-11T02:34:10.000Z

I am facing the same issue.

Answer 2 · 2024-03-11T14:23:28.000Z

Hi @suresh-pokharel, @abelavit

if you are not starved for GPU memory I would suggest to simply disable deepspeed. It won't make much difference in terms of training performance.

To do this remove the environment variables set in cell 2 and set deepspeed= False, in the train_per_residue() training call