microsoftarchive/BatchAI

Recipes don't include information about accessing results

mhauskn opened this issue · 4 comments

It seems the recipes end with submitting the job. However, having successfully followed a recipe to completion, it would be nice to know how to access the outputs from the job.

Having run the tensorflow example I attempt to access results and get an error as follows:

matthew@cantor:~/BatchAI/recipes/TensorFlow/TensorFlow-GPU$ az batchai job list-files --name tensorflow -d tensorflow_samples
Error occurred in request., RetryError: HTTPSConnectionPool(host='management.azure.com', port=443): Max retries exceeded with url: /subscriptions/6ad709f4-8451-47eb-b4aa-24733abf60e4/resourceGroups/batchaitests/providers/Microsoft.BatchAI/jobs/tensorflow/listOutputFiles?api-version=2017-09-01-preview&outputdirectoryid=tensorflow_samples&linkexpiryinminutes=60&maxresults=1000 (Caused by ResponseError('too many 500 error responses',))
Traceback (most recent call last):
  File "/opt/az/lib/python3.6/site-packages/requests/adapters.py", line 440, in send
    timeout=timeout
  File "/opt/az/lib/python3.6/site-packages/urllib3/connectionpool.py", line 732, in urlopen
    body_pos=body_pos, **response_kw)
  File "/opt/az/lib/python3.6/site-packages/urllib3/connectionpool.py", line 732, in urlopen
    body_pos=body_pos, **response_kw)
  File "/opt/az/lib/python3.6/site-packages/urllib3/connectionpool.py", line 732, in urlopen
    body_pos=body_pos, **response_kw)
  File "/opt/az/lib/python3.6/site-packages/urllib3/connectionpool.py", line 712, in urlopen
    retries = retries.increment(method, url, response=response, _pool=self)
  File "/opt/az/lib/python3.6/site-packages/urllib3/util/retry.py", line 388, in increment
    raise MaxRetryError(_pool, url, error or ResponseError(cause))
urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='management.azure.com', port=443): Max retries exceeded with url: /subscriptions/6ad709f4-8451-47eb-b4aa-24733abf60e4/resourceGroups/batchaitests/providers/Microsoft.BatchAI/jobs/tensorflow/listOutputFiles?api-version=2017-09-01-preview&outputdirectoryid=tensorflow_samples&linkexpiryinminutes=60&maxresults=1000 (Caused by ResponseError('too many 500 error responses',))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/az/lib/python3.6/site-packages/msrest/service_client.py", line 194, in send
    **kwargs)
  File "/opt/az/lib/python3.6/site-packages/requests/sessions.py", line 508, in request
    resp = self.send(prep, **send_kwargs)
  File "/opt/az/lib/python3.6/site-packages/requests/sessions.py", line 618, in send
    r = adapter.send(request, **kwargs)
  File "/opt/az/lib/python3.6/site-packages/requests/adapters.py", line 499, in send
    raise RetryError(e, request=request)
requests.exceptions.RetryError: HTTPSConnectionPool(host='management.azure.com', port=443): Max retries exceeded with url: /subscriptions/6ad709f4-8451-47eb-b4aa-24733abf60e4/resourceGroups/batchaitests/providers/Microsoft.BatchAI/jobs/tensorflow/listOutputFiles?api-version=2017-09-01-preview&outputdirectoryid=tensorflow_samples&linkexpiryinminutes=60&maxresults=1000 (Caused by ResponseError('too many 500 error responses',))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/az/lib/python3.6/site-packages/azure/cli/main.py", line 36, in main
    cmd_result = APPLICATION.execute(args)
  File "/opt/az/lib/python3.6/site-packages/azure/cli/core/application.py", line 212, in execute
    result = expanded_arg.func(params)
  File "/opt/az/lib/python3.6/site-packages/azure/cli/core/commands/__init__.py", line 377, in __call__
    return self.handler(*args, **kwargs)
  File "/opt/az/lib/python3.6/site-packages/azure/cli/core/commands/__init__.py", line 630, in _execute_command
    raise client_exception
  File "/opt/az/lib/python3.6/site-packages/azure/cli/core/commands/__init__.py", line 620, in _execute_command
    reraise(*sys.exc_info())
  File "/opt/az/lib/python3.6/site-packages/six.py", line 693, in reraise
    raise value
  File "/opt/az/lib/python3.6/site-packages/azure/cli/core/commands/__init__.py", line 602, in _execute_command
    result = op(client, **kwargs) if client else op(**kwargs)
  File "/opt/az/lib/python3.6/site-packages/azure/cli/command_modules/batchai/custom.py", line 332, in list_files
    return list(client.list_output_files(resource_group, job_name, options))
  File "/opt/az/lib/python3.6/site-packages/msrest/paging.py", line 109, in __next__
    self.advance_page()
  File "/opt/az/lib/python3.6/site-packages/msrest/paging.py", line 95, in advance_page
    self._response = self._get_next(self.next_link)
  File "/opt/az/lib/python3.6/site-packages/azure/mgmt/batchai/operations/jobs_operations.py", line 698, in internal_paging
    request, header_parameters, **operation_config)
  File "/opt/az/lib/python3.6/site-packages/msrest/service_client.py", line 220, in send
    raise_with_traceback(ClientRequestError, msg, err)
  File "/opt/az/lib/python3.6/site-packages/msrest/exceptions.py", line 45, in raise_with_traceback
    raise error.with_traceback(exc_traceback)
  File "/opt/az/lib/python3.6/site-packages/msrest/service_client.py", line 194, in send
    **kwargs)
  File "/opt/az/lib/python3.6/site-packages/requests/sessions.py", line 508, in request
    resp = self.send(prep, **send_kwargs)
  File "/opt/az/lib/python3.6/site-packages/requests/sessions.py", line 618, in send
    r = adapter.send(request, **kwargs)
  File "/opt/az/lib/python3.6/site-packages/requests/adapters.py", line 499, in send
    raise RetryError(e, request=request)
msrest.exceptions.ClientRequestError: Error occurred in request., RetryError: HTTPSConnectionPool(host='management.azure.com', port=443): Max retries exceeded with url: /subscriptions/6ad709f4-8451-47eb-b4aa-24733abf60e4/resourceGroups/batchaitests/providers/Microsoft.BatchAI/jobs/tensorflow/listOutputFiles?api-version=2017-09-01-preview&outputdirectoryid=tensorflow_samples&linkexpiryinminutes=60&maxresults=1000 (Caused by ResponseError('too many 500 error responses',))

Thank you for reporting the issue. Will investigate and resolve shortly

We will fix the error reporting. The issue is that you have specified wrong directory id in -d parameter. Directory id is either "stdouterr" for standard stdout and stderr streams or directory is as specified by "id" in "outputDirectories" definition.
e.g.
"outputDirectories": [{
"id": "MODEL",
"pathPrefix": "$AZ_BATCHAI_MOUNT_ROOT/external",
"pathSuffix": "Models"
}],

you need to provide "-d MODEL"

Thanks,
Alex

Thanks for the response. I eventually accessed the files through Azure portal, but will try to specify correct directory in future.