PlayFab/consuldotnet

Bad request : Invalid wait time and multiple Get request while trying to acquire leader

Opened this issue · 2 comments

Hi,
We are using this package (version 0.7.2.6) for Leader election using distributed lock.
It is based on blog : https://codeblab.herokuapp.com/high-availability-and-leader-election-with-consul-c-part-1/
It works pretty fine (thanks for great work).

We have multiple servers with our application (which using playfab nuget) + with consul.io agent on the same server and additional cluster of consul.io servers for high availability
Anyways, once in a few days we getting exception on Non leader application:
Consul.ConsulRequestException: Unexpected response, status code BadRequest: Invalid wait time

Also,
I see in consul agent logs:
2019/03/12 04:54:08 [DEBUG] http: Request GET /v1/kv/service/scm/leader?index=934475&wait=1.999s (1.9967478s) from=127.0.0.1:56777
2019/03/12 04:54:08 [DEBUG] http: Request GET /v1/kv/service/scm/leader?index=934475&wait=ms (0s) from=127.0.0.1:56777

Usually we try to acquire lock every 2 seconds, so we see every to seconds GET log with wait=...s..
But when problem happens we see 2 GETs when second has wait=ms, which causes that exception.
I checked playfab code, public async Task Acquire(CancellationToken ct) method. but I didn't find good explanation why it happens? Some rare race condition?

10x in advance

It looks like the timeout there is getting formatted as ms, probably a bug in the code that converts a timespan to a go duration string. Check out https://github.com/PlayFab/consuldotnet/blob/master/Consul/Utilities/Extensions.cs if you want to take a crack at fixing it, but otherwise I'll look at it when I can.

Actually the best fix is in code:
class Lock
public async Task Acquire(CancellationToken ct)
Instead of if (elapsed > qOpts.WaitTime)
it should be if (elapsed >= qOpts.WaitTime)
actually it looks like the only place where qOpts.WaitTime is updated and could be 0.

So i would like to do pull request with that fix