onpix/LLNeRF

About train!

Closed this issue · 6 comments

Hi
My GPU is V100 with 32GB memory. And I have only one gpu. But I found with your default setting, the training speed is very slow. For example, with the default setting(batch_size=1024.......), it takes about half an hour to finish 100 steps, so if i want to finish 10w steps,i need twenty days!
My question is why the speed is so slow? Can I change the batch_size to speed up? Or other variants need to change?Can you give me some suggestions?

onpix commented

Can you please check how long it takes to train the multinerf using the default settings? If it also takes 20 days, there might be an issue with your environment or machine. As far as I remember, I need around 12 hours on V100S or 8 hours on A100 to train our model or multinerf.

I found the code isn't train with GPU though i distributed one gpu.
When i change the batch_size from 1024 to 1, it can run fastly.
But both are not use gpu, they use cpu
image
image
Do you know what is the problem?

onpix commented

Again, it's a good idea to try running multinerf. If you found you can not use GPU while running multinerf, it is likely due to your environment, like an incorrect jaxlib version. But if you found you can run multinerf well but failed to run our code, please let me know as it's possibly because of our bugs.

image
Maybe is environment. How can i fix it?

the problem above i have fixed.But there is another problem:
image

onpix commented

It also looks like an environment problem. You may need to check the dependeny of packages.