Error while running Finetuning script
suresh-pokharel opened this issue · 2 comments
I have been trying to run this fine-tuning script: https://github.com/agemagician/ProtTrans/blob/master/Fine-Tuning/PT5_LoRA_Finetuning_per_residue_class.ipynb
But I am consistently getting this deepspeed-related problem. Any help or leads would be appreciated.
[2024-02-23 15:10:12,270] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the l egacy
(previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set legacy=False
. This should only be set if you u
nderstand what it means, and thoroughly read the reason why this was added as explained in huggingface/transformers#24565
Torch version: 2.2.1+cu121
Cuda version: 12.1
Numpy version: 1.26.4
Pandas version: 2.2.1
Transformers version: 4.38.1
Datasets version: 2.17.1
ProtT5_Classfier
Trainable Parameter: 1208144899
ProtT5_LoRA_Classfier
Trainable Parameter: 2510851
[2024-02-23 15:10:47,190] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-02-23 15:10:47,190] [INFO] [comm.py:668:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
Traceback (most recent call last):
File "/home/sureshp/ProtTrans/Fine-Tuning/ft_script.py", line 738, in <module>
tokenizer, model, history = train_per_residue(my_train, my_valid, num_labels=3, batch=1, accum=1, epochs=1, seed=42, gpu=0)
File "/home/sureshp/ProtTrans/Fine-Tuning/ft_script.py", line 726, in train_per_residue
trainer.train()
File "/home/sureshp/ProtTrans/Fine-Tuning/ft-venv/lib/python3.9/site-packages/transformers/trainer.py", line 1624, in train
return inner_training_loop(
File "/home/sureshp/ProtTrans/Fine-Tuning/ft-venv/lib/python3.9/site-packages/transformers/trainer.py", line 1779, in _inner_training_loop
model, self.optimizer, self.lr_scheduler = self.accelerator.prepare(
File "/home/sureshp/ProtTrans/Fine-Tuning/ft-venv/lib/python3.9/site-packages/accelerate/accelerator.py", line 1220, in prepare
result = self._prepare_deepspeed(*args)
File "/home/sureshp/ProtTrans/Fine-Tuning/ft-venv/lib/python3.9/site-packages/accelerate/accelerator.py", line 1605, in _prepare_deepspeed
engine, optimizer, _, lr_scheduler = deepspeed.initialize(**kwargs)
File "/home/sureshp/ProtTrans/Fine-Tuning/ft-venv/lib/python3.9/site-packages/deepspeed/__init__.py", line 176, in initialize
engine = DeepSpeedEngine(args=args,
File "/home/sureshp/ProtTrans/Fine-Tuning/ft-venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 307, in __init__
self._configure_optimizer(optimizer, model_parameters)
File "/home/sureshp/ProtTrans/Fine-Tuning/ft-venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1231, in _configure_optimizer
basic_optimizer = self._configure_basic_optimizer(model_parameters)
File "/home/sureshp/ProtTrans/Fine-Tuning/ft-venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1302, in _configure_basic_optimizer
optimizer = DeepSpeedCPUAdam(model_parameters,
File "/home/sureshp/ProtTrans/Fine-Tuning/ft-venv/lib/python3.9/site-packages/deepspeed/ops/adam/cpu_adam.py", line 94, in __init__
self.ds_opt_adam = CPUAdamBuilder().load()
File "/home/sureshp/ProtTrans/Fine-Tuning/ft-venv/lib/python3.9/site-packages/deepspeed/ops/op_builder/builder.py", line 478, in load
return self.jit_load(verbose)
File "/home/sureshp/ProtTrans/Fine-Tuning/ft-venv/lib/python3.9/site-packages/deepspeed/ops/op_builder/builder.py", line 498, in jit_load
extra_include_paths = [os.path.abspath(self.deepspeed_src_path(path)) for path in self.include_paths()]
File "/home/sureshp/ProtTrans/Fine-Tuning/ft-venv/lib/python3.9/site-packages/deepspeed/ops/op_builder/cpu_adam.py", line 41, in include_paths
CUDA_INCLUDE = [os.path.join(torch.utils.cpp_extension.CUDA_HOME, "include")]
File "/home/sureshp/anaconda3/envs/ft-1/lib/python3.9/posixpath.py", line 76, in join
a = os.fspath(a)
TypeError: expected str, bytes or os.PathLike object, not NoneType
Exception ignored in: <function DeepSpeedCPUAdam.__del__ at 0x7f8c59d88f70>
Traceback (most recent call last):
File "/home/sureshp/ProtTrans/Fine-Tuning/ft-venv/lib/python3.9/site-packages/deepspeed/ops/adam/cpu_adam.py", line 102, in __del__
self.ds_opt_adam.destroy_adam(self.opt_id)
AttributeError: 'DeepSpeedCPUAdam' object has no attribute 'ds_opt_adam'
I am facing the same issue.
Hi @suresh-pokharel, @abelavit
if you are not starved for GPU memory I would suggest to simply disable deepspeed. It won't make much difference in terms of training performance.
To do this remove the environment variables set in cell 2 and set deepspeed= False,
in the train_per_residue()
training call