zjunlp/OntoProtein

run_pretrain.sh 报错

Seyfried97 opened this issue · 4 comments

我配置了deepspeed环境,然后运行run_pretrain.sh,但出现了以下错误:

File "run_pretrain.py", line 135, in
main()
File "run_pretrain.py", line 131, in main
trainer.train()
File "OntoProtein/src/trainer.py", line 167, in train
deepspeed_engine, optimizer, lr_scheduler = deepspeed_init(
File "anaconda3/envs/deepspeed/lib/python3.8/site-packages/transformers/deepspeed.py", line 405, in deepspeed_init
hf_deepspeed_config.trainer_config_finalize(args, model, num_training_steps)
File "anaconda3/envs/deepspeed/lib/python3.8/site-packages/transformers/deepspeed.py", line 267, in trainer_config_finalize
hidden_size = model.config.hidden_size
File "anaconda3/envs/deepspeed/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1207, in getattr
raise AttributeError("'{}' object has no attribute '{}'".format(
AttributeError: 'OntoProteinPreTrainedModel' object has no attribute 'config'

然后我将config属性指向protein_model_config,并且运行了training_arg.py中注释掉的部分,结果出现了以下错误:

File "run_pretrain.py", line 135, in
main()
File "run_pretrain.py", line 131, in main
trainer.train()
File "OntoProtein/src/trainer.py", line 167, in train
deepspeed_engine, optimizer, lr_scheduler = deepspeed_init(
File "anaconda3/envs/deepspeed/lib/python3.8/site-packages/transformers/deepspeed.py", line 437, in deepspeed_init
deepspeed_engine, optimizer, _, lr_scheduler = deepspeed.initialize(**kwargs)
File "anaconda3/envs/deepspeed/lib/python3.8/site-packages/deepspeed/init.py", line 120, in initialize
engine = DeepSpeedEngine(args=args,
File "anaconda3/envs/deepspeed/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 239, in init
self._configure_with_arguments(args, mpu)
File "anaconda3/envs/deepspeed/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 872, in _configure_with_arguments
self._config = DeepSpeedConfig(self.config, mpu)
File "anaconda3/envs/deepspeed/lib/python3.8/site-packages/deepspeed/runtime/config.py", line 875, in init
self._configure_train_batch_size()
File "anaconda3/envs/deepspeed/lib/python3.8/site-packages/deepspeed/runtime/config.py", line 1051, in _configure_train_batch_size
self._batch_assertion()
File "anaconda3/envs/deepspeed/lib/python3.8/site-packages/deepspeed/runtime/config.py", line 987, in _batch_assertion
train_batch > 0
TypeError: '>' not supported between instances of 'str' and 'int'

请问这是什么原因引起的呢?

你好,

这是因为现在huggingface transformers中的deepspeed模块与我们之前用的有所差异。

我们已上传我们之前用的transformers中deepspeed模块源码deepspeed.py,可将transformers源码文件中transformers/deepspeed.py替换为我们提供的(注意备份)。

非常感谢您的帮助,我去尝试一下。

请问能否告诉我您使用的transformers具体的版本号呢?

transformers==4.9.2