AttributeError: 'DeepSpeedCPUAdam' object has no attribute 'ds_opt_adam'
Xiaolong-RRL opened this issue · 5 comments
Xiaolong-RRL commented
Dear author:
Thanks for your interesting work.
When I run 3D Models Training with 'sh scripts/train_lamm3d.sh' after Installation, the following error happened:
Using /data/x/.cache/torch_extensions/py310_cu117 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /data/x/.cache/torch_extensions/py310_cu117/cpu_adam/build.ninja...
Building extension module cpu_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
1.11.1.git.kitware.jobserver-1
Loading extension module cpu_adam...
Traceback (most recent call last):
File "/data/x/code/lamm/src/train.py", line 255, in <module>
main(**cfg)
File "/data/x/code/lamm/src/train.py", line 226, in main
agent = load_model(args)
File "/data/x/code/lamm/src/model/__init__.py", line 9, in load_model
agent = globals()[agent_name](model, args)
File "/data/x/code/lamm/src/model/agent.py", line 22, in __init__
self.ds_engine, self.optimizer, _, _ = deepspeed.initialize(
File "/data/x/miniconda3/envs/lamm/lib/python3.10/site-packages/deepspeed/__init__.py", line 165, in initialize
engine = DeepSpeedEngine(args=args,
File "/data/x/miniconda3/envs/lamm/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 309, in __init__
self._configure_optimizer(optimizer, model_parameters)
File "/data/x/miniconda3/envs/lamm/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1174, in _configure_optimizer
basic_optimizer = self._configure_basic_optimizer(model_parameters)
File "/data/x/miniconda3/envs/lamm/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1230, in _configure_basic_optimizer
optimizer = DeepSpeedCPUAdam(model_parameters,
File "/data/x/miniconda3/envs/lamm/lib/python3.10/site-packages/deepspeed/ops/adam/cpu_adam.py", line 94, in __init__
self.ds_opt_adam = CPUAdamBuilder().load()
File "/data/x/miniconda3/envs/lamm/lib/python3.10/site-packages/deepspeed/ops/op_builder/builder.py", line 454, in load
return self.jit_load(verbose)
File "/data/x/miniconda3/envs/lamm/lib/python3.10/site-packages/deepspeed/ops/op_builder/builder.py", line 497, in jit_load
op_module = load(name=self.name,
File "/data/x/miniconda3/envs/lamm/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1284, in load
return _jit_compile(
File "/data/x/miniconda3/envs/lamm/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1534, in _jit_compile
return _import_module_from_library(name, build_directory, is_python_module)
File "/data/x/miniconda3/envs/lamm/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1936, in _import_module_from_library
module = importlib.util.module_from_spec(spec)
File "<frozen importlib._bootstrap>", line 571, in module_from_spec
File "<frozen importlib._bootstrap_external>", line 1176, in create_module
File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
ImportError: /data/x/.cache/torch_extensions/py310_cu117/cpu_adam/cpu_adam.so: cannot open shared object file: No such file or directory
Exception ignored in: <function DeepSpeedCPUAdam.__del__ at 0x7ff049ef4ee0>
Traceback (most recent call last):
File "/data/x/miniconda3/envs/lamm/lib/python3.10/site-packages/deepspeed/ops/adam/cpu_adam.py", line 102, in __del__
self.ds_opt_adam.destroy_adam(self.opt_id)
AttributeError: 'DeepSpeedCPUAdam' object has no attribute 'ds_opt_adam'
It semms like the CPU_Adam did not compile successfully, and the following is my conda env:
# Name Version Build Channel
_libgcc_mutex 0.1 main
_openmp_mutex 5.1 1_gnu
absl-py 2.0.0 pypi_0 pypi
accelerate 0.23.0 pypi_0 pypi
asttokens 2.4.0 pypi_0 pypi
av 10.0.0 pypi_0 pypi
backcall 0.2.0 pypi_0 pypi
bigmodelvis 0.0.1 pypi_0 pypi
bzip2 1.0.8 h7b6447c_0
ca-certificates 2023.08.22 h06a4308_0
cachetools 5.3.1 pypi_0 pypi
certifi 2023.7.22 pypi_0 pypi
charset-normalizer 3.3.0 pypi_0 pypi
click 8.1.7 pypi_0 pypi
cmake 3.27.7 pypi_0 pypi
cython 3.0.4 pypi_0 pypi
data 0.4 pypi_0 pypi
decorator 5.1.1 pypi_0 pypi
decord 0.6.0 pypi_0 pypi
deepspeed 0.9.3 pypi_0 pypi
einops 0.7.0 pypi_0 pypi
exceptiongroup 1.1.3 pypi_0 pypi
executing 2.0.0 pypi_0 pypi
filelock 3.12.4 pypi_0 pypi
fsspec 2023.9.2 pypi_0 pypi
ftfy 6.1.1 pypi_0 pypi
funcsigs 1.0.2 pypi_0 pypi
fvcore 0.1.5.post20221221 pypi_0 pypi
google-auth 2.23.3 pypi_0 pypi
google-auth-oauthlib 1.1.0 pypi_0 pypi
grpcio 1.59.0 pypi_0 pypi
hjson 3.1.0 pypi_0 pypi
huggingface-hub 0.17.3 pypi_0 pypi
idna 3.4 pypi_0 pypi
iopath 0.1.10 pypi_0 pypi
ipdb 0.13.13 pypi_0 pypi
ipython 8.16.1 pypi_0 pypi
jedi 0.19.1 pypi_0 pypi
joblib 1.3.2 pypi_0 pypi
ld_impl_linux-64 2.38 h1181459_1
libffi 3.4.4 h6a678d5_0
libgcc-ng 11.2.0 h1234567_1
libgomp 11.2.0 h1234567_1
libstdcxx-ng 11.2.0 h1234567_1
libuuid 1.41.5 h5eee18b_0
markdown 3.5 pypi_0 pypi
markdown-it-py 3.0.0 pypi_0 pypi
markupsafe 2.1.3 pypi_0 pypi
matplotlib-inline 0.1.6 pypi_0 pypi
mdurl 0.1.2 pypi_0 pypi
ncurses 6.4 h6a678d5_0
networkx 3.2 pypi_0 pypi
ninja 1.11.1 pypi_0 pypi
nltk 3.8.1 pypi_0 pypi
numpy 1.26.1 pypi_0 pypi
oauthlib 3.2.2 pypi_0 pypi
openssl 3.0.11 h7f8727e_2
packaging 23.2 pypi_0 pypi
parameterized 0.9.0 pypi_0 pypi
parso 0.8.3 pypi_0 pypi
peft 0.3.0 pypi_0 pypi
pexpect 4.8.0 pypi_0 pypi
pickleshare 0.7.5 pypi_0 pypi
pillow 9.5.0 pypi_0 pypi
pip 23.3 py310h06a4308_0
plumbum 1.8.2 pypi_0 pypi
plyfile 1.0.1 pypi_0 pypi
pointnet2 0.0.0 pypi_0 pypi
portalocker 2.8.2 pypi_0 pypi
prompt-toolkit 3.0.39 pypi_0 pypi
protobuf 4.23.4 pypi_0 pypi
psutil 5.9.6 pypi_0 pypi
ptyprocess 0.7.0 pypi_0 pypi
pure-eval 0.2.2 pypi_0 pypi
py-cpuinfo 9.0.0 pypi_0 pypi
pyasn1 0.5.0 pypi_0 pypi
pyasn1-modules 0.3.0 pypi_0 pypi
pydantic 1.10.13 pypi_0 pypi
pygments 2.16.1 pypi_0 pypi
python 3.10.13 h955ad1f_0
pytorchvideo 0.1.5 pypi_0 pypi
pyyaml 6.0 pypi_0 pypi
pyzmq 25.1.1 pypi_0 pypi
readline 8.2 h5eee18b_0
regex 2022.10.31 pypi_0 pypi
requests 2.31.0 pypi_0 pypi
requests-oauthlib 1.3.1 pypi_0 pypi
rich 13.6.0 pypi_0 pypi
rpyc 5.3.1 pypi_0 pypi
rsa 4.9 pypi_0 pypi
safetensors 0.4.0 pypi_0 pypi
sentencepiece 0.1.99 pypi_0 pypi
setuptools 65.5.1 pypi_0 pypi
six 1.16.0 pypi_0 pypi
sqlite 3.41.2 h5eee18b_0
stack-data 0.6.3 pypi_0 pypi
tabulate 0.9.0 pypi_0 pypi
tensorboard 2.15.0 pypi_0 pypi
tensorboard-data-server 0.7.1 pypi_0 pypi
termcolor 2.3.0 pypi_0 pypi
timm 0.6.7 pypi_0 pypi
tk 8.6.12 h1ccaba5_0
tokenizers 0.14.1 pypi_0 pypi
tomli 2.0.1 pypi_0 pypi
torch 1.13.1+cu117 pypi_0 pypi
torchaudio 0.13.1+cu117 pypi_0 pypi
torchvision 0.14.1+cu117 pypi_0 pypi
tqdm 4.66.1 pypi_0 pypi
traitlets 5.11.2 pypi_0 pypi
transformers 4.34.1 pypi_0 pypi
trimesh 4.0.0 pypi_0 pypi
triton 2.0.0.dev20221202 pypi_0 pypi
typing-extensions 4.8.0 pypi_0 pypi
tzdata 2023c h04d1e81_0
urllib3 2.0.7 pypi_0 pypi
uvloop 0.18.0 pypi_0 pypi
wcwidth 0.2.8 pypi_0 pypi
werkzeug 3.0.0 pypi_0 pypi
wheel 0.41.2 py310h06a4308_0
xz 5.4.2 h5eee18b_0
yacs 0.1.8 pypi_0 pypi
zlib 1.2.13 h5eee18b_0
I wander wether you have encountered similar problems and how to solve them?
Best!
Xiaolong
Xiaolong-RRL commented
The error solved according to the issue. (By the way, the error occured in 4090)
bdytx5 commented
if u need a quick fix, disable optimizer CPU offload
isjakewong commented
if u need a quick fix, disable optimizer CPU offload
How can we disable it?
bdytx5 commented
it should just be left out in the config json i think, or switch to stage 2 possibly. I forget which one
bdytx5 commented
eg cpu offloading for optimizer is not specified