georghess/neurad-studio

Bus error (core dumped)!

Closed this issue · 12 comments

Describe the bug
Has anybody met the Bus error? :(

/root/miniconda3/envs/neurad/lib/python3.8/multiprocessing/resource_tracker.py:216: UserWarning: resource_tracker: There appear to be 27 leaked semaphore objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '
Bus error (core dumped)

To Reproduce
Steps to reproduce the behavior:
Run the code:

python nerfstudio/scripts/train.py neurad pandaset-data --data <pandaset-path>

I thought it's the env problem. But, it's not. I try any ways of setting up env, check the details of setting process. The error still happened! /cry

Hi! This could potentially be a sign of running out of memory. How large is the VRAM of your system?
Side note, multi-processing doesn't always give the nicest tracebacks. You could try to run with --pipeline.datamanager.num_processes=0 to disable multi-processing.

Good idea! I use the default setting of docker building. This may be there is just a bit VRAM that can be used. I try to expand it and test again. Thanks!!!

You can always adjust the batch size and/or model size if it turns out that your machine can't handle our default settings. Have a look at our debug configurations for some guidance on how to run a smaller model: https://github.com/georghess/neurad-studio/blob/main/.vscode/launch.json#L51

It works!!!! Thanks! The error point is the limited VRAM. I expanded it to 128G. The Bus error was fixed. The smaller model setting is sweet and must be helpful for the guys who's machine is limited. Thanks again!

It works!!!! Thanks! The error point is the limited VRAM. I expanded it to 128G. The Bus error was fixed. The smaller model setting is sweet and must be helpful for the guys who's machine is limited. Thanks again!

Hey, I got the same error when running it in a docker container. What did you exactly do to solve it? My memory is 250gigs and 237gb is available for usage. Thanks

Hi! This could potentially be a sign of running out of memory. How large is the VRAM of your system? Side note, multi-processing doesn't always give the nicest tracebacks. You could try to run with --pipeline.datamanager.num_processes=0 to disable multi-processing.

Hey, the vram on my machine is : (glxinfo | egrep -i 'device|memory')
Memory info (GL_NVX_gpu_memory_info):
Dedicated video memory: 24576 MB
Total available memory: 24576 MB
Currently available dedicated video memory: 22128 MB

I'm running it inside a docker container on rtx3090 with 250gigs memory of which 237gb is available.

Hi. Could you try to run with --pipeline.datamanager.num_processes=0 and attach the log with the error you get?

Unrecognized or misplaced options: --pipeline.datamanager.num-processes │
│ ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── │
│ Perhaps you meant: │
│ --pipeline.datamanager.num-processes INT │
│ Number of processes to use for train data loading. More than 1 doesn't result in that much better performance (default: 6) │
│ in train.py neurad --help │
│ Number of processes to use for train data loading. More than 1 doesn't result in that much better performance (default: 1) │
│ in train.py nerfacto --help │
│ in train.py nerfacto-big --help │
│ in train.py nerfacto-huge --help │
│ [...] │
│ ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── │
│ For full helptext, run train.py --help

When I set the flag to 0 and try running, it doesn't recognize the flag

Unrecognized or misplaced options: --pipeline.datamanager.num-processes │ │ ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── │ │ Perhaps you meant: │ │ --pipeline.datamanager.num-processes INT │ │ Number of processes to use for train data loading. More than 1 doesn't result in that much better performance (default: 6) │ │ in train.py neurad --help │ │ Number of processes to use for train data loading. More than 1 doesn't result in that much better performance (default: 1) │ │ in train.py nerfacto --help │ │ in train.py nerfacto-big --help │ │ in train.py nerfacto-huge --help │ │ [...] │ │ ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── │ │ For full helptext, run train.py --help

When I set the flag to 0 and try running, it doesn't recognize the flag

Maybe it's num_processes=0 ("_")? I fixed this error by expanding VRAM to 128G.

@amoghskanda Then you have most likely added the argument in the wrong place. It is an argument to the pipeline, so it should be added after the method name, not after the dataset name. For example:
python nerfstudio/scripts/train.py neurad --pipeline.datamanager.num_processes=0 pandaset-data --data data/pandaset

Note that the number of processes used to load the data is most likely not the issue here, but running without multi-processing (i.e by setting num_processes=0) usually gives a nicer error traceback.

@amoghskanda Then you have most likely added the argument in the wrong place. It is an argument to the pipeline, so it should be added after the method name, not after the dataset name. For example: python nerfstudio/scripts/train.py neurad --pipeline.datamanager.num_processes=0 pandaset-data --data data/pandaset

Hey thanks for this. It worked. I'm able to train neurad inside a docker container