microsoft/AutonomousDrivingCookbook

Distributed RL batch training on Azure error

wonjoonSeol opened this issue · 1 comments

It is very difficult to reproduce the result shown in the paper by following the steps in the tutorial.

Issue 1: unable to run jobs on Azure using LaunchTrainJob.ipynb

azure error2

Running LaunchTrainJob notebook result :
TaskSchedulingConstraintFailed Reason: The user used to run the task is not found

We specified batch_job_user_name in the json, but that creates this error.
I need to change user identity to Task user (Admin). Then this problem goes away.

azure error

After fixing that issue, I end up with The specified command program is not found

CommandLine: call C:\\prereq\\mount.bat && C:\\ProgramData\\Anaconda3\\Scripts\\activate.bat py36 && python -u Z:\\scripts_downpour\\app\\distributed_agent.py data_dir=Z: role=agent max_epoch_runtime_sec=30 per_iter_epsilon_reduction=0.003000 min_epsilon=0.100000 batch_size=32 replay_memory_size=2000 experiment_name=distributed_rl_75726dee-3f90-41e4-8657-3f7ae8dc924d weights_path=Z:\data\pretrain_model_weights.h5 train_conv_layers=false
Message: The system cannot find the file specified.

Notice that weights_path=Z:\data\pretrain_model_weights.h5 (generated from the code) does not have extra escape character '\', I tried adding that too but still the same error.

I honestly don't think anyone who star this repo has actually ran the code themselves.
This issue 1 is the most critical part because I cannot run the training job.

Issue 2: SetupCluster.ipynb

This one is merely for bug reporting.

with open('Template\\pool.json.template', 'r') as f:
    pool_config = f.read()
    
pool_config = pool_config\
                .replace('{batch_pool_name}', NOTEBOOK_CONFIG['batch_pool_name'])\
                .replace('{subscription_id}', NOTEBOOK_CONFIG['subscription_id'])\
                .replace('{resource_group_name}', NOTEBOOK_CONFIG['resource_group_name'])\
                .replace('{storage_account_name}', NOTEBOOK_CONFIG['storage_account_name'])\
                .replace('{batch_job_user_name}', NOTEBOOK_CONFIG['batch_job_user_name'])\
                .replace('{batch_job_user_password}', NOTEBOOK_CONFIG['batch_job_user_password'])\
                .replace('{batch_pool_size}', str(NOTEBOOK_CONFIG['batch_pool_size']))

with open('pool.json', 'w') as f:
    f.write(pool_config)
    
create_cmd = 'powershell.exe ".\ProvisionCluster.ps1 -subscriptionId {0} -resourceGroupName {1} -batchAccountName {2}"'\
    .format(NOTEBOOK_CONFIG['subscription_id'], NOTEBOOK_CONFIG['resource_group_name'], NOTEBOOK_CONFIG['batch_account_name'])
    
print('Executing command. Check the terminal output for authentication instructions.')

os.system(create_cmd)

This code no longer works, this is because the json file it creates no longer contains sufficient information to create a pool on the latest Azure cloud.

I created a pool manually using Batch Explorer, I noticed that the pool should be created without adding any 'Start Task' and then set Start Task separately after creating the pool. Otherwise, you end up with the error:

InvalidPropertyValue
The value provided for one of the properties in the request body is invalid.

PropertyName: dataDisks
Reason: Only one of dataDisks and virtualMachineImageId can be specified

LaunchTrainingJob.ipynb

Syntax error in the code:
batch_client = batch.BatchServiceClient(batch_credentials, base_url=NOTEBOOK_CONFIG['batch_account_url'])

Should be :

batch_client = batch.BatchServiceClient(credentials=batch_credentials, **batch_url**=NOTEBOOK_CONFIG['batch_account_url'])

Similarily,

job = batch.models.JobAddParameter(
        job_id,
        batch.models.PoolInformation(pool_id=NOTEBOOK_CONFIG['batch_pool_name']))

batch_client.job.add(job)

Should be :

job = batch.models.JobAddParameter(
        id=job_id,
        **pool_info**=batch.models.PoolInformation(pool_id=NOTEBOOK_CONFIG['batch_pool_name']))

Miscellaneous

  • Should be careful with choosing Azure server region. Not many regions have NV6. So trying to create a pool in those regions will cause an error. (I am currently using US East)
  • Make sure to upgrade your free-trial to pay-as-go and request for higher batch quota via support ticket. Free-trial subscription doesn't offer NV6.

Thanks for the report. This worked a year ago when we initially wrote the tutorial; it looks like the API has changed a bit from under us. We'll look at updating it.