drprojects/superpoint_transformer

RuntimeError: DataLoader worker (pid 15163) is killed by signal: Bus error. It is possible that dataloader's workers are out of shared memory. Please try to raise your shared memory limit

gardiens opened this issue ยท 8 comments

Hello,
Thanks you for sharing your amazing repo,
I have a question:
When you trained your model and especially with the 11GPU, how much share memory did you have?
After struggling on preprocessing I have now a lot of issue with share memory error :(
My error is this one
RuntimeError: DataLoader worker (pid 15163) is killed by signal: Bus error. It is possible that dataloader's workers are out of shared memory. Please try to raise your shared memory limit

Hi, the config should normally work for a 11G device, even if tight. Did you make sure you had no other process running on your GPU ? In particular, are you perhaps using a desktop computer's GPU ? If so, you do not want any graphical user interface taking space on it, so make sure you are running on a GUI-free session. You do need to have all 11G 100% available for the python process.

I think I am the only one using the GPU and I always lack memory,,
However I'm using a old version of pytorch (11.3) and I suspect that the memory may leak if the item is a dict or list ( pytorch/pytorch#13246 (comment) ) so I will try if using numpy array or pandas may help or using other hyperparameters

Before going into any of these modifications, did you make sure with nvidia-smi that there is no other process taking memory ?

If so, before you go into tricky dataloader memory optimization, I would rather recommend you change some of the dataset, model, or trainer parameters. Some suggestions can be found in the README

Depending on which dataset you are using, and when the error occurs, I could suggest other tricks for lowering value.
NB: make sure you set CUDA_LAUNCH_BLOCKING=1 to access the proper traceback when debugging CUDA errors ๐Ÿ˜‰

Ok I found my a way to solve it:
If you are experiencing this kind of errors in the training session,

RuntimeError: DataLoader worker (pid 15163) is killed by signal: Bus error. It is possible that dataloader's workers are out of shared memory. Please try to raise your shared memory limit.  Or 
multiprocessing.pool.MaybeEncodingError: Error sending result: '[Data(y=[2121526], pos=[2121526, 3], rgb=[2121526, 3], is_val=[2121526], pos_room=[2121526, 3])]'. Reason: 'RuntimeError('Cannot allocate memory')'

Which are mostly a RAM issue,
You can set the value of in_memory of the dataset ( here s3dis) to False, it will slow a bit the training process but the shared memory doesn't go out of control at the first epoch
( around 1 minute per epoch for me )
I don't know if it will solve the issue on the preprocessing step which consume too much RAM for me, but it made the training runable

Oh I see, I thought you were encountering a CUDA OOM error, but it is actually your CPU memory which is limited. Indeed, setting in_memory=False for the S3DIS dataset will save some RAM, at the expense of training speed. If increasing the CPU RAM on your machine is a possibility, I would recommend doing so. As reference, all my machines have at least 64G of RAM, the project has not been tested with less.

Well with 45 GB RAM on average it is not enough but I thought it would have been enough with the idea of "superpoint" :(
Thank you anyway for your quick reply !

For the record, superpoints do save a lot of GPU memory when computing high level features (compared to voxel-based or point-based methods).

In our case, large CPU memory is needed if we want to load large point clouds and prepare related batches to be fed to the GPU. It is not that superpoints do not save memory, it is that they can process much larger batches than other methods. But you could change that.

Said otherwise, you could reduce CPU memory by setting in_memory=False and playing with the memory-affecting parameters indicated here to reduce the number of points in a batch.

Best,
Damien