Docker shared memory issue and solution
peteflorence opened this issue · 15 comments
I am not sure if this is happening in our various other configurations, but it was happening in my spartan
Docker container inside which I put PyTorch and was trying to do some training.
Symptom
I was getting an error something like, "Bus error (core dumped) model share memory". It's related to this issue: pytorch/pytorch#2244
Cause
Following the comments by apaszke (a PyTorch author) are helpful here (pytorch/pytorch#1355 (comment)) in which, running inside the Docker container, it appears the only available shared memory is 64 megs:
peteflo@08482dc37efa:~$ df -h | grep shm
shm 64M 0 64M 0% /dev/shm
Temp Solution
As mentioned by apaszke,
sudo mount -o remount,size=8G /dev/shm
(choose more than 8G if you'd like)
This fixes it, as visible here:
peteflo@08482dc37efa:~$ df -h | grep shm
shm 8.0G 0 8.0G 0% /dev/shm
Other notes
Some places on the internet you will find that --ipc=host
is supposed to avoid this issue, as can other flags to the docker run process, but those didn't work for me, and involve re-opening the container. I suspect something about my configuration is wrong. The above issue fixes it even while inside the container.
Long term solution
It would first be useful to identify if anybody else's docker containers have this issue, which can be simply evaluated by df -h | grep shm
inside the container. Then we could diagnose who it is happening to and why. It might just be me.
Yes that would work but first would like to ascertain if anybody else has this issue.
I've done a lot of work with PyTorch in Docker before but haven't had this, so would like to understand what's different.
Is easy to test your own docker setup, just run:
df -h | grep shm
why not use: docker run --shm-size 8G
Yeah I have it inside my spartan container as well.
manuelli@paladin-44:~/spartan$ df -h | grep shm
shm 64M 0 64M 0% /dev/shm
but inside pdc
container I have 31G.
manuelli@paladin-44:~/code$ df -h | grep shm
tmpfs 32G 882M 31G 3% /dev/shm
So we must have something different between pdc
and spartan
docker containers that is causing this.
Resolved by either passing --ipc=host
or --shm-size 8G
I did have the arg in the wrong spot in the docker_run.py
string it builds up!
Looked at it with @manuelli this morning
We might just want to add --ipc=host
by default to spartan
@peteflorence If both ipc=host
and shm-size
work for increasing shared memory, could you help me understand the difference?
Both solutions worked for me (though in a separate container that runs PyTorch). Root cause is still unknown? Otherwise perhaps this issue is resolved.
Is there a way to override the path used by Pytorch multiprocess (/dev/shm). Unfortunately, increasing shared memory is not possible for me.
Something like %env JOBLIB_TEMP_FOLDER=/tmp, which works for sklearn.