Distributed RL batch training on Azure error
wonjoonSeol opened this issue · 1 comments
It is very difficult to reproduce the result shown in the paper by following the steps in the tutorial.
Issue 1: unable to run jobs on Azure using LaunchTrainJob.ipynb
Running LaunchTrainJob notebook result :
TaskSchedulingConstraintFailed Reason: The user used to run the task is not found
We specified batch_job_user_name in the json, but that creates this error.
I need to change user identity to Task user (Admin). Then this problem goes away.
After fixing that issue, I end up with The specified command program is not found
CommandLine: call C:\\prereq\\mount.bat && C:\\ProgramData\\Anaconda3\\Scripts\\activate.bat py36 && python -u Z:\\scripts_downpour\\app\\distributed_agent.py data_dir=Z: role=agent max_epoch_runtime_sec=30 per_iter_epsilon_reduction=0.003000 min_epsilon=0.100000 batch_size=32 replay_memory_size=2000 experiment_name=distributed_rl_75726dee-3f90-41e4-8657-3f7ae8dc924d weights_path=Z:\data\pretrain_model_weights.h5 train_conv_layers=false
Message: The system cannot find the file specified.
Notice that weights_path=Z:\data\pretrain_model_weights.h5 (generated from the code) does not have extra escape character '\', I tried adding that too but still the same error.
I honestly don't think anyone who star this repo has actually ran the code themselves.
This issue 1 is the most critical part because I cannot run the training job.
Issue 2: SetupCluster.ipynb
This one is merely for bug reporting.
with open('Template\\pool.json.template', 'r') as f:
pool_config = f.read()
pool_config = pool_config\
.replace('{batch_pool_name}', NOTEBOOK_CONFIG['batch_pool_name'])\
.replace('{subscription_id}', NOTEBOOK_CONFIG['subscription_id'])\
.replace('{resource_group_name}', NOTEBOOK_CONFIG['resource_group_name'])\
.replace('{storage_account_name}', NOTEBOOK_CONFIG['storage_account_name'])\
.replace('{batch_job_user_name}', NOTEBOOK_CONFIG['batch_job_user_name'])\
.replace('{batch_job_user_password}', NOTEBOOK_CONFIG['batch_job_user_password'])\
.replace('{batch_pool_size}', str(NOTEBOOK_CONFIG['batch_pool_size']))
with open('pool.json', 'w') as f:
f.write(pool_config)
create_cmd = 'powershell.exe ".\ProvisionCluster.ps1 -subscriptionId {0} -resourceGroupName {1} -batchAccountName {2}"'\
.format(NOTEBOOK_CONFIG['subscription_id'], NOTEBOOK_CONFIG['resource_group_name'], NOTEBOOK_CONFIG['batch_account_name'])
print('Executing command. Check the terminal output for authentication instructions.')
os.system(create_cmd)
This code no longer works, this is because the json file it creates no longer contains sufficient information to create a pool on the latest Azure cloud.
I created a pool manually using Batch Explorer, I noticed that the pool should be created without adding any 'Start Task' and then set Start Task separately after creating the pool. Otherwise, you end up with the error:
InvalidPropertyValue
The value provided for one of the properties in the request body is invalid.
PropertyName: dataDisks
Reason: Only one of dataDisks and virtualMachineImageId can be specified
LaunchTrainingJob.ipynb
Syntax error in the code:
batch_client = batch.BatchServiceClient(batch_credentials, base_url=NOTEBOOK_CONFIG['batch_account_url'])
Should be :
batch_client = batch.BatchServiceClient(credentials=batch_credentials, **batch_url**=NOTEBOOK_CONFIG['batch_account_url'])
Similarily,
job = batch.models.JobAddParameter(
job_id,
batch.models.PoolInformation(pool_id=NOTEBOOK_CONFIG['batch_pool_name']))
batch_client.job.add(job)
Should be :
job = batch.models.JobAddParameter(
id=job_id,
**pool_info**=batch.models.PoolInformation(pool_id=NOTEBOOK_CONFIG['batch_pool_name']))
Miscellaneous
- Should be careful with choosing Azure server region. Not many regions have NV6. So trying to create a pool in those regions will cause an error. (I am currently using US East)
- Make sure to upgrade your free-trial to pay-as-go and request for higher batch quota via support ticket. Free-trial subscription doesn't offer NV6.
Thanks for the report. This worked a year ago when we initially wrote the tutorial; it looks like the API has changed a bit from under us. We'll look at updating it.