Recipes don't include information about accessing results
mhauskn opened this issue · 4 comments
It seems the recipes end with submitting the job. However, having successfully followed a recipe to completion, it would be nice to know how to access the outputs from the job.
Having run the tensorflow example I attempt to access results and get an error as follows:
matthew@cantor:~/BatchAI/recipes/TensorFlow/TensorFlow-GPU$ az batchai job list-files --name tensorflow -d tensorflow_samples
Error occurred in request., RetryError: HTTPSConnectionPool(host='management.azure.com', port=443): Max retries exceeded with url: /subscriptions/6ad709f4-8451-47eb-b4aa-24733abf60e4/resourceGroups/batchaitests/providers/Microsoft.BatchAI/jobs/tensorflow/listOutputFiles?api-version=2017-09-01-preview&outputdirectoryid=tensorflow_samples&linkexpiryinminutes=60&maxresults=1000 (Caused by ResponseError('too many 500 error responses',))
Traceback (most recent call last):
File "/opt/az/lib/python3.6/site-packages/requests/adapters.py", line 440, in send
timeout=timeout
File "/opt/az/lib/python3.6/site-packages/urllib3/connectionpool.py", line 732, in urlopen
body_pos=body_pos, **response_kw)
File "/opt/az/lib/python3.6/site-packages/urllib3/connectionpool.py", line 732, in urlopen
body_pos=body_pos, **response_kw)
File "/opt/az/lib/python3.6/site-packages/urllib3/connectionpool.py", line 732, in urlopen
body_pos=body_pos, **response_kw)
File "/opt/az/lib/python3.6/site-packages/urllib3/connectionpool.py", line 712, in urlopen
retries = retries.increment(method, url, response=response, _pool=self)
File "/opt/az/lib/python3.6/site-packages/urllib3/util/retry.py", line 388, in increment
raise MaxRetryError(_pool, url, error or ResponseError(cause))
urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='management.azure.com', port=443): Max retries exceeded with url: /subscriptions/6ad709f4-8451-47eb-b4aa-24733abf60e4/resourceGroups/batchaitests/providers/Microsoft.BatchAI/jobs/tensorflow/listOutputFiles?api-version=2017-09-01-preview&outputdirectoryid=tensorflow_samples&linkexpiryinminutes=60&maxresults=1000 (Caused by ResponseError('too many 500 error responses',))
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/opt/az/lib/python3.6/site-packages/msrest/service_client.py", line 194, in send
**kwargs)
File "/opt/az/lib/python3.6/site-packages/requests/sessions.py", line 508, in request
resp = self.send(prep, **send_kwargs)
File "/opt/az/lib/python3.6/site-packages/requests/sessions.py", line 618, in send
r = adapter.send(request, **kwargs)
File "/opt/az/lib/python3.6/site-packages/requests/adapters.py", line 499, in send
raise RetryError(e, request=request)
requests.exceptions.RetryError: HTTPSConnectionPool(host='management.azure.com', port=443): Max retries exceeded with url: /subscriptions/6ad709f4-8451-47eb-b4aa-24733abf60e4/resourceGroups/batchaitests/providers/Microsoft.BatchAI/jobs/tensorflow/listOutputFiles?api-version=2017-09-01-preview&outputdirectoryid=tensorflow_samples&linkexpiryinminutes=60&maxresults=1000 (Caused by ResponseError('too many 500 error responses',))
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/opt/az/lib/python3.6/site-packages/azure/cli/main.py", line 36, in main
cmd_result = APPLICATION.execute(args)
File "/opt/az/lib/python3.6/site-packages/azure/cli/core/application.py", line 212, in execute
result = expanded_arg.func(params)
File "/opt/az/lib/python3.6/site-packages/azure/cli/core/commands/__init__.py", line 377, in __call__
return self.handler(*args, **kwargs)
File "/opt/az/lib/python3.6/site-packages/azure/cli/core/commands/__init__.py", line 630, in _execute_command
raise client_exception
File "/opt/az/lib/python3.6/site-packages/azure/cli/core/commands/__init__.py", line 620, in _execute_command
reraise(*sys.exc_info())
File "/opt/az/lib/python3.6/site-packages/six.py", line 693, in reraise
raise value
File "/opt/az/lib/python3.6/site-packages/azure/cli/core/commands/__init__.py", line 602, in _execute_command
result = op(client, **kwargs) if client else op(**kwargs)
File "/opt/az/lib/python3.6/site-packages/azure/cli/command_modules/batchai/custom.py", line 332, in list_files
return list(client.list_output_files(resource_group, job_name, options))
File "/opt/az/lib/python3.6/site-packages/msrest/paging.py", line 109, in __next__
self.advance_page()
File "/opt/az/lib/python3.6/site-packages/msrest/paging.py", line 95, in advance_page
self._response = self._get_next(self.next_link)
File "/opt/az/lib/python3.6/site-packages/azure/mgmt/batchai/operations/jobs_operations.py", line 698, in internal_paging
request, header_parameters, **operation_config)
File "/opt/az/lib/python3.6/site-packages/msrest/service_client.py", line 220, in send
raise_with_traceback(ClientRequestError, msg, err)
File "/opt/az/lib/python3.6/site-packages/msrest/exceptions.py", line 45, in raise_with_traceback
raise error.with_traceback(exc_traceback)
File "/opt/az/lib/python3.6/site-packages/msrest/service_client.py", line 194, in send
**kwargs)
File "/opt/az/lib/python3.6/site-packages/requests/sessions.py", line 508, in request
resp = self.send(prep, **send_kwargs)
File "/opt/az/lib/python3.6/site-packages/requests/sessions.py", line 618, in send
r = adapter.send(request, **kwargs)
File "/opt/az/lib/python3.6/site-packages/requests/adapters.py", line 499, in send
raise RetryError(e, request=request)
msrest.exceptions.ClientRequestError: Error occurred in request., RetryError: HTTPSConnectionPool(host='management.azure.com', port=443): Max retries exceeded with url: /subscriptions/6ad709f4-8451-47eb-b4aa-24733abf60e4/resourceGroups/batchaitests/providers/Microsoft.BatchAI/jobs/tensorflow/listOutputFiles?api-version=2017-09-01-preview&outputdirectoryid=tensorflow_samples&linkexpiryinminutes=60&maxresults=1000 (Caused by ResponseError('too many 500 error responses',))
Thank you for reporting the issue. Will investigate and resolve shortly
We will fix the error reporting. The issue is that you have specified wrong directory id in -d parameter. Directory id is either "stdouterr" for standard stdout and stderr streams or directory is as specified by "id" in "outputDirectories" definition.
e.g.
"outputDirectories": [{
"id": "MODEL",
"pathPrefix": "$AZ_BATCHAI_MOUNT_ROOT/external",
"pathSuffix": "Models"
}],
you need to provide "-d MODEL"
Thanks,
Alex
Thanks for the response. I eventually accessed the files through Azure portal, but will try to specify correct directory in future.