Bug: supervise seems to crash on backend error.
Closed this issue · 5 comments
After successfully kicking off multiple workflows, supervise failed with the following stacktrace:
Traceback (most recent call last):
File "/usr/local/bin/fissfc", line 11, in <module>
sys.exit(main())
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/firecloud/fiss.py", line 2074, in main
result = args.func(args)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/firecloud/fiss.py", line 996, in supervise
recovery_file, api_url)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/firecloud/supervisor.py", line 44, in supervise
supervise_until_complete(monitor_data, dependencies, args, recovery_file)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/firecloud/supervisor.py", line 231, in supervise_until_complete
api_root=api_url
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/firecloud/api.py", line 1016, in create_submission
return __post(uri, json=body)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/firecloud/api.py", line 55, in __post
headers = _fiss_access_headers({"Content-type": "application/json"})
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/firecloud/api.py", line 38, in _fiss_access_headers
access_token = credentials.get_access_token().access_token
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/oauth2client/client.py", line 663, in get_access_token
self.refresh(http)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/oauth2client/client.py", line 545, in refresh
self._refresh(http)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/oauth2client/client.py", line 749, in _refresh
self._do_refresh_request(http)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/oauth2client/client.py", line 819, in _do_refresh_request
raise HttpAccessTokenRefreshError(error_msg, status=resp.status)
oauth2client.client.HttpAccessTokenRefreshError: internal_failure: Backend Error
Because our typical supervised workflows run are expected to run for a while, should we not expect the authentication credentials to expire (and yield something like the HttpAccessTokenRefreshError seen here)?
Other thoughts out loud:
How long did it take for this to error to occur?
And do the FireCloud (or Google) docs indicate how long a token should persist? If we know this we can code into the supervisor an automatic refresh of the token.
It took minutes to occur.
My thought is that we should be more robust to backend errors - either via a set number of retries, or recognizing specific ones that we can code a solution to.
Agreed, in principle. But we'd need to be able to identify all backend errors as such, and also discern on a case-by-case basis which of those are recoverable. Not clear to me that a general solution exists, but I'd be happy to be wrong.
The particular manifestation here seems recoverable, or perhaps avoidable outright, by beginning a supervise session with credential establishment ... then periodically refreshing.
Also: how reproducible is this--i.e. how often does it happen? Can you provide more details on exactly what was typed, and how many workflows were kicked off before the failure was seen?