broadinstitute/fiss

Bug: supervise seems to crash on backend error.

Closed this issue · 5 comments

After successfully kicking off multiple workflows, supervise failed with the following stacktrace:

Traceback (most recent call last):
  File "/usr/local/bin/fissfc", line 11, in <module>
    sys.exit(main())
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/firecloud/fiss.py", line 2074, in main
    result = args.func(args)
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/firecloud/fiss.py", line 996, in supervise
    recovery_file, api_url)
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/firecloud/supervisor.py", line 44, in supervise
    supervise_until_complete(monitor_data, dependencies, args, recovery_file)
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/firecloud/supervisor.py", line 231, in supervise_until_complete
    api_root=api_url
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/firecloud/api.py", line 1016, in create_submission
    return __post(uri, json=body)
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/firecloud/api.py", line 55, in __post
    headers = _fiss_access_headers({"Content-type":  "application/json"})
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/firecloud/api.py", line 38, in _fiss_access_headers
    access_token = credentials.get_access_token().access_token
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/oauth2client/client.py", line 663, in get_access_token
    self.refresh(http)
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/oauth2client/client.py", line 545, in refresh
    self._refresh(http)
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/oauth2client/client.py", line 749, in _refresh
    self._do_refresh_request(http)
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/oauth2client/client.py", line 819, in _do_refresh_request
    raise HttpAccessTokenRefreshError(error_msg, status=resp.status)
oauth2client.client.HttpAccessTokenRefreshError: internal_failure: Backend Error

Because our typical supervised workflows run are expected to run for a while, should we not expect the authentication credentials to expire (and yield something like the HttpAccessTokenRefreshError seen here)?

Other thoughts out loud:

How long did it take for this to error to occur?
And do the FireCloud (or Google) docs indicate how long a token should persist? If we know this we can code into the supervisor an automatic refresh of the token.

It took minutes to occur.

My thought is that we should be more robust to backend errors - either via a set number of retries, or recognizing specific ones that we can code a solution to.

Agreed, in principle. But we'd need to be able to identify all backend errors as such, and also discern on a case-by-case basis which of those are recoverable. Not clear to me that a general solution exists, but I'd be happy to be wrong.

The particular manifestation here seems recoverable, or perhaps avoidable outright, by beginning a supervise session with credential establishment ... then periodically refreshing.

Also: how reproducible is this--i.e. how often does it happen? Can you provide more details on exactly what was typed, and how many workflows were kicked off before the failure was seen?

Fixed by #45