youtube/doorman

[leader election] master should't give up it's leadership easily

BlueBlue-Lee opened this issue · 1 comments

Is this a BUG REPORT or FEATURE REQUEST? (choose one):
feature request

What happened:
There is leader election in doorman server, which is achieved by etcd. Set a key with delay ttl and continually refresh it every 1/3 delay interval.

when leader is down, this etcd key will expire. And then new leader is elected.

see the source code:

	go func() {
		for {
			log.V(2).Infof("trying to become master at %v", e.lock)
			if _, err := kapi.Set(ctx, e.lock, id, &client.SetOptions{
				TTL:       e.delay,
				PrevExist: client.PrevNoExist,
			}); err != nil {
				log.V(2).Infof("failed becoming the master, retrying in %v: %v", e.delay, err)
				time.Sleep(e.delay)
				continue
			}
			e.isMaster <- true
			log.V(2).Info("Became master at %v as %v.", e.lock, id)
			for {
				time.Sleep(e.delay / 3)
				log.V(2).Infof("Renewing mastership lease at %v as %v", e.lock, id)
				_, err := kapi.Set(ctx, e.lock, id, &client.SetOptions{
					TTL:       e.delay,
					PrevExist: client.PrevExist,
					PrevValue: id,
				})

				if err != nil {
					log.V(2).Info("lost mastership")
					e.isMaster <- false
					break
				}
			}
		}
	}()

when master fail to renew lease because some temp reasons, for example network jitter, it just loses leadership easily. But actually, if the master try again, it will renew lease successfully.

This problem will resulting in unnecessary learning mode and it takes time to converge.

What you expected to happen or what your proposal is:

I think we shold add retry mechanism when renew lease. If it fails twice or other retry-counts, then lose its leadership.