[Zero123] GPU not fully used and training stucks
csyhping opened this issue · 3 comments
Hi @bennyguo , thanks for your excellent work. I tried to test Zero123/Stable Zero123, but the training is always stuck. I've tried every related issue; I tried to run on one 3090 or two 3090, I tried different images in the example folder, etc., but the training is still stuck, and the GPU usage is really low (~5000MiB / 24GiB, if I use 2 GPUS then ~5000MiB/24GiB each).
I saw some other people also have such an issue; do you have any idea about this? Thanks!!
I tried muli-view only example, it can work, but the original 3D version still failed.
Finally solved...
For me, it is the nerfacc
version issue; the code is stuck at nerfacc.estimator.sampling
.Re-install pip install nerfacc==0.5.2
solved my problem. stuck at nerfacc utility #23
For anyone with a similar issue, try to use debug tools like ipdb
to check where the stuck occurs; sometimes, it may not be the GPU's issue (which seems like one).
I hope this helps.