Bug: supervise seems to crash on backend error.

Question

Bug: supervise seems to crash on backend error.

Closed this issue 7 years ago · 5 comments

After successfully kicking off multiple workflows, supervise failed with the following stacktrace:

Traceback (most recent call last):
  File "/usr/local/bin/fissfc", line 11, in <module>
    sys.exit(main())
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/firecloud/fiss.py", line 2074, in main
    result = args.func(args)
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/firecloud/fiss.py", line 996, in supervise
    recovery_file, api_url)
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/firecloud/supervisor.py", line 44, in supervise
    supervise_until_complete(monitor_data, dependencies, args, recovery_file)
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/firecloud/supervisor.py", line 231, in supervise_until_complete
    api_root=api_url
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/firecloud/api.py", line 1016, in create_submission
    return __post(uri, json=body)
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/firecloud/api.py", line 55, in __post
    headers = _fiss_access_headers({"Content-type":  "application/json"})
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/firecloud/api.py", line 38, in _fiss_access_headers
    access_token = credentials.get_access_token().access_token
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/oauth2client/client.py", line 663, in get_access_token
    self.refresh(http)
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/oauth2client/client.py", line 545, in refresh
    self._refresh(http)
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/oauth2client/client.py", line 749, in _refresh
    self._do_refresh_request(http)
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/oauth2client/client.py", line 819, in _do_refresh_request
    raise HttpAccessTokenRefreshError(error_msg, status=resp.status)
oauth2client.client.HttpAccessTokenRefreshError: internal_failure: Backend Error

Answer 1 · 2017-06-09T13:55:40.000Z

Because our typical supervised workflows run are expected to run for a while, should we not expect the authentication credentials to expire (and yield something like the HttpAccessTokenRefreshError seen here)?

Other thoughts out loud:

How long did it take for this to error to occur?
And do the FireCloud (or Google) docs indicate how long a token should persist? If we know this we can code into the supervisor an automatic refresh of the token.

Answer 2 · 2017-06-09T13:58:07.000Z

It took minutes to occur.

My thought is that we should be more robust to backend errors - either via a set number of retries, or recognizing specific ones that we can code a solution to.

Answer 3 · 2017-06-09T15:40:45.000Z

Agreed, in principle. But we'd need to be able to identify all backend errors as such, and also discern on a case-by-case basis which of those are recoverable. Not clear to me that a general solution exists, but I'd be happy to be wrong.

The particular manifestation here seems recoverable, or perhaps avoidable outright, by beginning a supervise session with credential establishment ... then periodically refreshing.

Answer 4 · 2017-06-09T15:55:44.000Z

Also: how reproducible is this--i.e. how often does it happen? Can you provide more details on exactly what was typed, and how many workflows were kicked off before the failure was seen?

Answer 5 · 2017-08-09T13:54:37.000Z

Fixed by #45