microsoft/MLOS

Debug logging hiding real Azure Compute API errors in AzureDeploymentService

Opened this issue · 2 comments

In _check_operation_status() of AzureDeploymentService, debug logging relying on a JSON response from the management operation should be done after checking for response status code is 200.

image

In this case, we have been having cryptic JSON decoding errors that are hard to reproduce (trying to decode empty JSON). However looking more carefully into the stack trace, we can see that a 503 response caused this issue. 503 is a Service Unavailable issue from the compute-side API so that should be handled separately, but it would be good to surface the real error instead of JSON decoding issue.

2024-07-24 07:10:42,882 azure_deployment_services.py:238 _check_operation_status DEBUG Response: <Response [200]>
{
  "startTime": "2024-07-24T06:52:41.3832595+00:00",
  "status": "InProgress",
  "name": "ced88d6f-259f-4fba-8636-376158009bba"
}
2024-07-24 07:10:42,883 azure_deployment_services.py:316 _wait_while DEBUG Sleep for: 6.11 of 20.00 s
2024-07-24 07:10:48,999 azure_auth.py:119 get_access_token DEBUG Time to renew the token: 940.00 sec.
2024-07-24 07:10:49,003 connectionpool.py:1052 _new_conn DEBUG Starting new HTTPS connection (1): management.azure.com:443
2024-07-24 07:12:40,005 connectionpool.py:546 _make_request DEBUG https://management.azure.com:443 "GET /subscriptions/<REDACTED>/providers/Microsoft.Compute/locations/eastus2/operations/<REDACTED> HTTP/1.1" 503 988
2024-07-24 07:12:40,081 base_scheduler.py:132 __exit__ WARNING Scheduler END :: SyncScheduler
Traceback (most recent call last):
  File "/opt/conda/envs/mlos/lib/python3.11/site-packages/requests/models.py", line 971, in json
    return complexjson.loads(self.text, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/envs/mlos/lib/python3.11/json/__init__.py", line 346, in loads
    return _default_decoder.decode(s)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/envs/mlos/lib/python3.11/json/decoder.py", line 337, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/envs/mlos/lib/python3.11/json/decoder.py", line 355, in raw_decode
    raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/workspaces/MySQL-Autotuning/MLOS/mlos_bench/mlos_bench/run.py", line 29, in _main
    scheduler_context.start()
  File "/workspaces/MySQL-Autotuning/MLOS/mlos_bench/mlos_bench/schedulers/sync_scheduler.py", line 39, in start
    self._run_schedule(is_warm_up)
  File "/workspaces/MySQL-Autotuning/MLOS/mlos_bench/mlos_bench/schedulers/base_scheduler.py", line 243, in _run_schedule
    self.run_trial(trial)
  File "/workspaces/MySQL-Autotuning/MLOS/mlos_bench/mlos_bench/schedulers/sync_scheduler.py", line 56, in run_trial
    (status, timestamp, results) = self.environment.run()  # Block and wait for the final result.
                                   ^^^^^^^^^^^^^^^^^^^^^^
  File "/workspaces/MySQL-Autotuning/MLOS/mlos_bench/mlos_bench/environments/composite_env.py", line 204, in run
    (status, timestamp, metrics) = env_context.run()
                                   ^^^^^^^^^^^^^^^^^
  File "/workspaces/MySQL-Autotuning/MLOS/mlos_bench/mlos_bench/environments/composite_env.py", line 204, in run
    (status, timestamp, metrics) = env_context.run()
                                   ^^^^^^^^^^^^^^^^^
  File "/workspaces/MySQL-Autotuning/MLOS/mlos_bench/mlos_bench/environments/composite_env.py", line 204, in run
    (status, timestamp, metrics) = env_context.run()
                                   ^^^^^^^^^^^^^^^^^
  File "/workspaces/MySQL-Autotuning/MLOS/mlos_bench/mlos_bench/environments/remote/remote_env.py", line 139, in run
    (status, timestamp, output) = self._remote_exec(self._script_run)
                                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/workspaces/MySQL-Autotuning/MLOS/mlos_bench/mlos_bench/environments/remote/remote_env.py", line 176, in _remote_exec
    (status, output) = self._remote_exec_service.get_remote_exec_results(output)
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/workspaces/MySQL-Autotuning/MLOS/mlos_bench/mlos_bench/services/remote/azure/azure_vm_services.py", line 512, in get_remote_exec_results
    (status, result) = self.wait_host_operation(config)
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/workspaces/MySQL-Autotuning/MLOS/mlos_bench/mlos_bench/services/remote/azure/azure_vm_services.py", line 217, in wait_host_operation
    return self._wait_while(self._check_operation_status, Status.RUNNING, params)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/workspaces/MySQL-Autotuning/MLOS/mlos_bench/mlos_bench/services/remote/azure/azure_deployment_services.py", line 319, in _wait_while
    (status, output) = func(params)
                       ^^^^^^^^^^^^
  File "/workspaces/MySQL-Autotuning/MLOS/mlos_bench/mlos_bench/services/remote/azure/azure_deployment_services.py", line 239, in _check_operation_status
    json.dumps(response.json(), indent=2)
               ^^^^^^^^^^^^^^^
  File "/opt/conda/envs/mlos/lib/python3.11/site-packages/requests/models.py", line 975, in json
    raise RequestsJSONDecodeError(e.msg, e.doc, e.pos)
requests.exceptions.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

@eujing do you just want to relax the required logging level from DEBUG to INFO here? Or is there something else you're proposing?

@bpkroth I have some changes proposed, which are mainly to:

  1. Add 503 to a list of extra response codes to retry on
  2. Do some exception handling on the debug logging formatting so we dont hide valid bad responses.

Will add a PR soon with the changes I have been using for long running experiments!