Custom MPI options doesn't override the flags
ChaiBapchya opened this issue · 3 comments
Describe the bug
custom_mpi_options flag in the sagemaker training toolkit isn't over-riding the MPI command instead it just appends the flags
Logic
sagemaker-training-toolkit/src/sagemaker_training/mpi.py
Lines 185 to 188 in c357433
To reproduce
mpi_options = '-verbose -x orte_base_help_aggregate=0 -map-by socket -rank-by core'
estimator = MXNet(
entry_point='hvd_resnet_mx.py',
role=role,
train_instance_type='ml.p3.8xlarge',
train_instance_count=2,
image_name=image,
framework_version='1.6.0',
py_version='py3',
hyperparameters={'sagemaker_mpi_enabled': True,
'sagemaker_mpi_custom_mpi_options': mpi_options,
'sagemaker_mpi_num_of_processes_per_host': 4},
sagemaker_session=sagemaker_session)
Invoking this command doesn't override the mpi
mpirun --host algo-1:4,algo-2:4 -np 8 --allow-run-as-root --display-map
--tag-output -mca btl_tcp_if_include eth0 -mca oob_tcp_if_include eth0 -mca plm_rsh_no_tree_spawn 1 -bind-to socket
-map-by slot -mca pml ob1 -mca btl ^openib
-mca orte_abort_on_non_zero_status 1 -x NCCL_MIN_NRINGS=4 -x NCCL_SOCKET_IFNAME=eth0 -x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH -x PATH -x LD_PRELOAD=/usr/local/lib/python3.6/site-packages/gethostname.cpython-36m-x86_64-linux-gnu.so -verbose -x orte_base_help_aggregate=0
-map-by socket -rank-by core -x SM_HOSTS -x SM_NETWORK_INTERFACE_NAME -x SM_HPS -x SM_USER_ENTRY_POINT -x SM_FRAMEWORK_PARAMS -x SM_RESOURCE_CONFIG -x SM_INPUT_DATA_CONFIG -x SM_OUTPUT_DATA_DIR -x SM_CHANNELS -x SM_CURRENT_HOST -x SM_MODULE_NAME -x SM_LOG_LEVEL -x SM_FRAMEWORK_MODULE -x SM_INPUT_DIR -x SM_INPUT_CONFIG_DIR -x SM_OUTPUT_DIR -x SM_NUM_CPUS -x SM_NUM_GPUS -x SM_MODEL_DIR -x SM_MODULE_DIR -x SM_TRAINING_ENV -x SM_USER_ARGS -x SM_OUTPUT_INTERMEDIATE_DIR -x PYTHONPATH /usr/local/bin/python3.6 -m mpi4py hvd_resnet_mx.py
Expected behavior
Expected to not see -map-by
twice.
-map-by slot
-map-by socket
System information
Latest MX Dockerfile
Thank you for reporting that!
Moreover, since this is training-toolkit specific, it would be an issue regardless of the framework.
You can override and env var -X key=value
, but not sure if you can MPI parmaters like -rank-by core
. In details:
I tested and if you set an env var (-X) with custom_mpi_options it overrides the defaults set of env vars. As it’s concatenated at the end of the mpirun command.
Estimator distribution dict:
distribution = { "mpi": {"enabled": True, "custom_mpi_options": "-x NCCL_MIN_NRINGS=1", } }
Runtime output when printing env var from within the MPI worker:
rv81uo39sp-algo-1-wr5k6 | [1,mpirank:0,algo-1]<stdout>:NCCL_MIN_NRINGS=1