Service retry logic incorrectly swallows errors and retries when it shouldn't
Opened this issue · 1 comments
@the logs below illustrate a failing smoketest due to being run in the wrong region, but they also highlight a more important problem: the retry logic (for services, at least) incorrectly swallows an AWS UnsupportedFeatureException
error and proceeds with retry logic when it should instead be failing immediately and returning that error. In other words, we only avoid retry logic for a whitelist of errors, when we should only be entering retry logic for a whitelist of errors. Failing immediately should be the default, not retrying.
Logs from service smoketests:
bats service.bats
✗ create
(in test file service.bats, line 10)
`l0 service create env_name svc_name_stateless1 dpl_name_stateless:latest' failed
ENVIRONMENT ID ENVIRONMENT NAME OS LINKS
envname360ae env_name linux
DEPLOY ID DEPLOY NAME VERSION COMPATIBILITIES
dplname54de5 dpl_name_stateful1 2 stateful
DEPLOY ID DEPLOY NAME VERSION COMPATIBILITIES
dplname10bf4 dpl_name_stateful2 2 stateful
DEPLOY ID DEPLOY NAME VERSION COMPATIBILITIES
dplnamef885d dpl_name_stateless 12 stateful
stateless
LOADBALANCER ID LOADBALANCER NAME TYPE ENVIRONMENT SERVICE PORTS PUBLIC URL
lbnamea28010 lb_name_alb application env_name 80:80/HTTP true l0-jlprefactortest-lbnamea28010-1930624962.us-west-2.elb.amazonaws.com
LOADBALANCER ID LOADBALANCER NAME TYPE ENVIRONMENT SERVICE PORTS PUBLIC URL
lbnamecdaff3 lb_name_clb classic env_name 80:80/HTTP true l0-jlprefactortest-lbnamecdaff3-1410965074.us-west-2.elb.amazonaws.com
2018/04/13 09:58:02 Maximum retry attempts reached
Logs from layer0 API:
2018/04/13 09:57:59 POST /service
2018/04/13 09:58:00 [DEBUG] an api error occured: ServerError (code='EventualConsistencyError') UnsupportedFeatureException: FARGATE is not supported in this region.
status code: 400, request id: cdfcd78e-3f3b-11e8-99b2-7f217f1e677c
2018/04/13 09:58:00 POST /service
2018/04/13 09:58:01 [DEBUG] an api error occured: ServerError (code='EventualConsistencyError') UnsupportedFeatureException: FARGATE is not supported in this region.
status code: 400, request id: cea9e083-3f3b-11e8-99b2-7f217f1e677c
2018/04/13 09:58:01 POST /service
2018/04/13 09:58:02 [DEBUG] an api error occured: ServerError (code='EventualConsistencyError') UnsupportedFeatureException: FARGATE is not supported in this region.
status code: 400, request id: cf58700a-3f3b-11e8-99b2-7f217f1e677c
From Slack
-
because of the nature of the retry, the "unable to assume role" error, for example, never gets returned to the outer
retry.Retry
-
another issue is maybe in RetryWithTimeout, since that creates its own error to stop the loop, and wouldn't return the AWS err
-
One idea is to make it the caller’s responsibility to track errors:
var err error
fn := func() bool {
err = foo()
if err, ok := err.(*awserr.Error); ok && err.Code == "..." {
return true
}
return false
}
retry(fn)
return err
that could work because it should retry again and err = foo()
should make err = nil
if foo()
succeeds. If it retries or times out, err
keeps the last error