sarafridov/K-Planes

About the memory needed for training cut_roasted_beef video with dynerf_hybird config file

Closed this issue · 8 comments

Hi sarafridov,
Thanks for sharing the code of your great project.
I wonder how much memory is needef for training cut_roasted_beef video with dynerf_hybird config file. The memory of my machine is 256G, the processes failed all the time for lack of memory.

My best guess is that it's failing during the preprocessing step of computing the ray importance sampling weights. Here's what a recommend (a bit more specific version of the suggestion on the readme):

  1. Edit the config file to use 4x downsampling instead of 2x, and change the number of training steps to 1. Run with this config, and it will compute and save the importance sampling weights at the lower resolution, which should fit in memory.
  2. Reset the config file back to default, and run it again. It will load and interpolate the downsampled importance sampling weights, rather than recomputing them, so it should fit in memory. In my experience 100G of CPU memory should be sufficient, following this two-step procedure.

Thanks for your help. I tried as you said, but the process stopped at 'Loading test data' with no response or reported error.
image

Yeah that can happen sometimes, it's a concurrency bug in multi-threaded loading.. It's annoying but if you ctrl+c and try again it should work. Let me know if it's still not loading

I tried to run main.py again for about 8 times, it all failed with the same problem. Besides, reading training video was sucessful but quite slow. Got no idea what to do. Appreciate it a lot for your generous help.

I'm not sure if this will fix the issue or not, but if it's related to multithreading you could try reducing the number of threads here https://github.com/sarafridov/K-Planes/blob/main/plenoxels/datasets/data_loading.py#L144

I tried so, but it's still not working. Is there other way to solve the problem?

Do you have this issue with other datasets (e.g. D-NeRF, or regular NeRF) or do those load properly? Also, are you getting this issue in the initial preprocessing run at downsample=4, or after that when you try to actually train at downsample=2?

That's really weird, after about 25 times running, it finally works. The issue occurred only in reading test data. While the iterator was successfully created by Pool.imap and tqdm info was outputed, it makes no sense that the process stopped at 'out=next(iterator)'.
I'm sure it's not the problem of slow speed of reading data, since the longest time I waited was a whole night.