continuously growing memory
anonymoussss opened this issue · 0 comments
anonymoussss commented
Hi, I am training DETR on coco dataset with default training script as follows,
python -m torch.distributed.launch --nproc_per_node=8 --use_env main.py --coco_path /path/to/coco
But every time I train a few epochs, it reports an error as follows,
ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm).
... ...
RuntimeError: DataLoader worker (pid 8686) is killed by signal: Bus error. It is possible that dataloader's workers are out of shared memory. Please try to raise your shared memory limit.
... ...
RuntimeError: DataLoader worker (pid(s) 8686) exited unexpectedly
I checked the memory usage using free -h
and found that the memory usage continued to increase until it crashed during training. How to solve this problem?
My mechine have 256G memory,8 T4 GPUs. I run the training script in a docker container with ’ --shm 256G ‘, cuda 11.7, python3.8.5, torch 2.01, torchvison 0.15.2