microsoft/InnerEye-DeepLearning

`NODE_RANK` KeyError on training runs

peterhessey opened this issue · 0 comments

Is there an existing issue for this?

  • I have searched the existing issues

Bug summary

When training models in InnerEye the error below is encountered due to recent changes in AML. This has been fixed in the latest hi-ml version (v0.2.5), so IE-DL needs to be updated to this.

Code for reproduction

python InnerEye/ML/runner.py --model=Lung--azureml

Actual outcome

Training run fails

Error messages

File "InnerEye/ML/runner.py", line 466, in <module>
    main()
  File "InnerEye/ML/runner.py", line 460, in main
    run(project_root=fixed_paths.repository_root_directory(),
  File "InnerEye/ML/runner.py", line 456, in run
    return runner.run()
  File "InnerEye/ML/runner.py", line 220, in run
    self.run_in_situ(azure_run_info)
  File "InnerEye/ML/runner.py", line 408, in run_in_situ
    set_environment_variables_for_multi_node()
  File "/mnt/azureml/cr/j/bc3f99f19bb745519fd9272cfd730249/exe/wd/InnerEye/Azure/azure_runner.py", line 313, in set_environment_variables_for_multi_node
    env_vars = ", ".join(f"{var} = {os.environ[var]}" for var in [ENV_MASTER_ADDR, ENV_MASTER_PORT, ENV_NODE_RANK])
  File "/mnt/azureml/cr/j/bc3f99f19bb745519fd9272cfd730249/exe/wd/InnerEye/Azure/azure_runner.py", line 313, in <genexpr>
    env_vars = ", ".join(f"{var} = {os.environ[var]}" for var in [ENV_MASTER_ADDR, ENV_MASTER_PORT, ENV_NODE_RANK])
  File "/azureml-envs/azureml_e12c14b51edf42f47eec39c741162949/lib/python3.8/os.py", line 675, in __getitem__
    raise KeyError(key) from None
KeyError: 'NODE_RANK'

Expected outcome

Successful training run

System info

No response

AB#7305