quintilesims/layer0

Service retry logic incorrectly swallows errors and retries when it shouldn't

Opened this issue · 1 comments

tlake commented

@the logs below illustrate a failing smoketest due to being run in the wrong region, but they also highlight a more important problem: the retry logic (for services, at least) incorrectly swallows an AWS UnsupportedFeatureException error and proceeds with retry logic when it should instead be failing immediately and returning that error. In other words, we only avoid retry logic for a whitelist of errors, when we should only be entering retry logic for a whitelist of errors. Failing immediately should be the default, not retrying.

Logs from service smoketests:

bats service.bats
 ✗ create
   (in test file service.bats, line 10)
     `l0 service create env_name svc_name_stateless1 dpl_name_stateless:latest' failed
   ENVIRONMENT ID  ENVIRONMENT NAME  OS     LINKS
   envname360ae    env_name          linux  
   DEPLOY ID     DEPLOY NAME         VERSION  COMPATIBILITIES
   dplname54de5  dpl_name_stateful1  2        stateful
   DEPLOY ID     DEPLOY NAME         VERSION  COMPATIBILITIES
   dplname10bf4  dpl_name_stateful2  2        stateful
   DEPLOY ID     DEPLOY NAME         VERSION  COMPATIBILITIES
   dplnamef885d  dpl_name_stateless  12       stateful
                                              stateless
   LOADBALANCER ID  LOADBALANCER NAME  TYPE         ENVIRONMENT  SERVICE  PORTS       PUBLIC  URL
   lbnamea28010     lb_name_alb        application  env_name              80:80/HTTP  true    l0-jlprefactortest-lbnamea28010-1930624962.us-west-2.elb.amazonaws.com
   LOADBALANCER ID  LOADBALANCER NAME  TYPE     ENVIRONMENT  SERVICE  PORTS       PUBLIC  URL
   lbnamecdaff3     lb_name_clb        classic  env_name              80:80/HTTP  true    l0-jlprefactortest-lbnamecdaff3-1410965074.us-west-2.elb.amazonaws.com
   2018/04/13 09:58:02 Maximum retry attempts reached

Logs from layer0 API:

2018/04/13 09:57:59 POST /service 
2018/04/13 09:58:00 [DEBUG] an api error occured: ServerError (code='EventualConsistencyError') UnsupportedFeatureException: FARGATE is not supported in this region.
    status code: 400, request id: cdfcd78e-3f3b-11e8-99b2-7f217f1e677c
2018/04/13 09:58:00 POST /service 
2018/04/13 09:58:01 [DEBUG] an api error occured: ServerError (code='EventualConsistencyError') UnsupportedFeatureException: FARGATE is not supported in this region.
    status code: 400, request id: cea9e083-3f3b-11e8-99b2-7f217f1e677c
2018/04/13 09:58:01 POST /service 
2018/04/13 09:58:02 [DEBUG] an api error occured: ServerError (code='EventualConsistencyError') UnsupportedFeatureException: FARGATE is not supported in this region.
    status code: 400, request id: cf58700a-3f3b-11e8-99b2-7f217f1e677c

From Slack

  • because of the nature of the retry, the "unable to assume role" error, for example, never gets returned to the outer retry.Retry

  • another issue is maybe in RetryWithTimeout, since that creates its own error to stop the loop, and wouldn't return the AWS err

  • One idea is to make it the caller’s responsibility to track errors:

var err error 
fn := func() bool {
    err = foo()
    if err, ok := err.(*awserr.Error); ok && err.Code == "..." {
        return true
    }

    return false
}

retry(fn)
return err

that could work because it should retry again and err = foo() should make err = nil if foo() succeeds. If it retries or times out, err keeps the last error