pytorch training issue: insufficient shared memory
Eric-Zhang1990 opened this issue · 2 comments
@Tomcli @sboagibm I want to run maskrcnn-benchmark project (https://github.com/facebookresearch/maskrcnn-benchmark),
but I find NOTE at https://github.com/IBM/FfDL/blob/master/docs/user-guide.md#24-creating-manifest-file (Note that all model definition files has to be in the first level of the zip file and there are no nested directories in the zip file.), which means I can't run job which has multi directories, right?
However, all our projects have multi directories, how can I run them on FfDL?
Thank you.
Hi @Eric-Zhang1990, technically you can package your code with the bash zip
command if you have multi-directories. (e.g. zip -r model.zip maskrcnn-benchmark
)
In the backend, FfDL will unzip the model.zip to the learner container's work_dir. Then, you can modify the command in manifest.yml to execute your workload.
(e.g. python maskrcnn_benchmark/tools/train_net.py
)
@Tomcli Thank you for your kind reply. I loss a parameter '-r' when zip model files.
However, I encounter another problem about 'shm', which is "ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm).".
my config is:
gpus: 1
cpus: 4
learners: 1
memory: 16Gb
framework:
name: custom
version: "pytorch:1.0-gpu"
command: >
export NGPUS=1; source /etc/profile; bash ${MODEL_DIR}/setup.sh; . env_file; python ${MODEL_DIR}/setup.py build develop; python -m torch.distributed.launch --nproc_per_node=$NUM_GPUS ${MODEL_DIR}/train_net.py;
Thank you.