Output directory environment variables are not being set
Keegil opened this issue · 2 comments
I've modified the Horovod recipe to train a Keras model, and I'm having trouble saving results and models to the file share because the AZ_BATCHAI_INPUT_x and AZ_BATCHAI_OUTPUT_x variables are either not being set or aren't accessible in the python kernel.
According to the documentation of azure.mgmt.batchai.models.JobCreateParameters:
"Batch AI service sets the following environment variables for all jobs: AZ_BATCHAI_INPUT_id, AZ_BATCHAI_OUTPUT_id, AZ_BATCHAI_NUM_GPUS_PER_NODE."
And according to the documentation of azure.mgmt.batchai.models.OutputDirectory:
"The name for the output directory. It will be available for the job as an environment variable under AZ_BATCHAI_OUTPUT_id."
However:
- No environment variables show up under "Environment variables" under the job details in the Azure portal
- If I try to access the environment variables using os.environ['AZ_BATCHAI_OUTPUT_x'], I get a KeyError
This is the code for configuring the job:
parameters = models.job_create_parameters.JobCreateParameters(
location=cfg.location,
cluster=models.ResourceId(cluster.id),
node_count=4,
input_directories=[
models.InputDirectory(id='SCRIPTS', path='$AZ_BATCHAI_MOUNT_ROOT/{0}/scripts'.format(azure_file_share))
],
output_directories=[
models.OutputDirectory(id='MODELS', path_prefix='$AZ_BATCHAI_MOUNT_ROOT/{0}'.format(azure_file_share), path_suffix='models'),
models.OutputDirectory(id='RESULTS', path_prefix='$AZ_BATCHAI_MOUNT_ROOT/{0}'.format(azure_file_share), path_suffix='results')
],
std_out_err_path_prefix="$AZ_BATCHAI_MOUNT_ROOT/{0}".format(azure_file_share),
container_settings=models.ContainerSettings(
models.ImageSourceRegistry(image='tensorflow/tensorflow:1.4.0-gpu-py3')),
job_preparation=models.JobPreparation(
command_line="apt update; apt install mpi-default-dev mpi-default-bin -y; pip install azure; pip install horovod; pip install keras; pip install h5py"),
custom_toolkit_settings = models.CustomToolkitSettings(
command_line='mpirun -mca btl_tcp_if_exclude docker0,lo --allow-run-as-root --hostfile $AZ_BATCHAI_MPI_HOST_FILE python $AZ_BATCHAI_INPUT_SCRIPTS/dev-DeepAttach-trainrater-dist-py3.py'))
Please take a look at command line which is used for launching workers:
"mpirun -mca btl_tcp_if_exclude docker0,lo --allow-run-as-root --hostfile $AZ_BATCHAI_MPI_HOST_FILE python $AZ_BATCHAI_INPUT_SCRIPTS/tensorflow_mnist.py"
if you need some environment variable been available to each worker you need to provide it's value explicitly in mpirun call. But i would suggest to pass output directory via command line arguments (e.g. python $AZ_BATCHAI_INPUT_SCRIPTS/tensorflow_mnist.py --output=$AZ_BATCHAI_OUTPUT_MODEL) instead.
Thanks a lot; the latter solution you suggested works very well!