IBM/FfDL

pytorch training issue: insufficient shared memory

Closed this issue · 2 comments

@Tomcli @sboagibm I want to run maskrcnn-benchmark project (https://github.com/facebookresearch/maskrcnn-benchmark),
image

深度截图_选择区域_20190314174440
but I find NOTE at https://github.com/IBM/FfDL/blob/master/docs/user-guide.md#24-creating-manifest-file (Note that all model definition files has to be in the first level of the zip file and there are no nested directories in the zip file.), which means I can't run job which has multi directories, right?
However, all our projects have multi directories, how can I run them on FfDL?
Thank you.

Hi @Eric-Zhang1990, technically you can package your code with the bash zip command if you have multi-directories. (e.g. zip -r model.zip maskrcnn-benchmark)

In the backend, FfDL will unzip the model.zip to the learner container's work_dir. Then, you can modify the command in manifest.yml to execute your workload.
(e.g. python maskrcnn_benchmark/tools/train_net.py)

@Tomcli Thank you for your kind reply. I loss a parameter '-r' when zip model files.
However, I encounter another problem about 'shm', which is "ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm).".
my config is:
gpus: 1
cpus: 4
learners: 1
memory: 16Gb

framework:
name: custom
version: "pytorch:1.0-gpu"
command: >
export NGPUS=1; source /etc/profile; bash ${MODEL_DIR}/setup.sh; . env_file; python ${MODEL_DIR}/setup.py build develop; python -m torch.distributed.launch --nproc_per_node=$NUM_GPUS ${MODEL_DIR}/train_net.py;

Thank you.