Docker fails due to lack of space . Segmentation Fault
DebayanChakraborty opened this issue · 3 comments
Hi ,
This is the first time i am trying to run a project through Docker.
By default the container is running on the root partition . At run time, after the first epoch, the code fails due to segmentation error . Can you instruct me on how to use this docker on another partition and run the process there. I have a 1 TB partition at my disposal . I am housing the Project in this partition only and also building the container there . However on the run step i can see that the app , by default gets mounted on the /usr partition.
Thanks in Advance
Debayan
Hi @DebayanChakraborty ,
when you start the docker container like this:
docker run -it -v "$PWD":/src bilstm bash
The $PWD is replaced with your current folder on the host system, i.e. the call looks like this:
docker run -it -v /home/debayan/emnlp2017-bilstm-cnn-crf:/src bilstm bash
The /src folder in the docker container then uses the same folder as /home/debayan/emnlp2017-bilstm-cnn-crf
. If your home folder is on the 1 TB partition, then this partition is used. Otherwise you need to move the git repository to your 1 TB partition and start the docker from there.
But I think the segmentation error does not stem from the partition.
Depending on the OS, the docker machine as limited memory. The deep learning requires quite some memory, so it is advisable to increase the memory limit for your docker machine.
Another option could be to use conda (for example anaconda or mini-conda) to create the correct environment. How to do this is described in this repository:
https://github.com/UKPLab/elmo-bilstm-cnn-crf
Hey ,
So i followed your advice and now have increased the memory of the machine. Also i am using devicemapper as a storage for docker in place of overlay as the dm extensions were not working for limit of memory .
i currently have this configuration
- CPU cores -16
- Memory - 128 GB
- Swap Memory :4 gb (docker does not support this )
The status of my container is as follows .
I ran the program , and it ran through training first epoch for over an hour and a half .
But then again segmentation fault occurred and core dumped.
I suspect there is still some issue when IO operation happens as in the code i found the model save script which tries to save the model .
Can you guide me to any other possible cases why my program fails after running for training ?
Thanks in advance
Debayan
Hmm, I'm sadly clueless why this happens. 126 GB should be enough. It appears there might be some issue with saving the model, but I don't know what the cause could be for this.
You can maybe try to run it with conda (https://conda.io/docs/) and avoid using docker. Then we will see if the issue is docker or if it some python code.
Depending on the pre-trained word embeddings file, the stored models can become quite large. Maybe some component (python, docker) have an issue with storing too large files (for example files larger than 4 GB).
For debugging I recommend to reduce the training file to only maybe 1 oder 2 sentences. So an epoch will be fast and you can find the issue quicker.