tarantool/vshard

Add a backoff to `master_call()` and `router_call()` when master discovery fails multiple times

Gerold103 opened this issue · 6 comments

If master consistently reports garbage (for example, it says it is master, but refuses all write-mode requests), then the code shouldn't retry the calls right after each other. It might make sense to add a backoff timeout which grows with each weird error like that until some maximal value.

The alternative would be to remember the previous error and not report it when it happens again in a row. But a backoff timeout is probably better because it would reduce pressure on the faulty node.

I don't understand the motivation of ticket and it roots. Is the problem the fact, that too many errors are written to the log? If it's so, then how will backoff help us to reduce their amount? master_call doesn't log any errors, it's mostly services, which constantly retries call and log error, their behavior should be changed.

Add a backoff to master_call() and router_call()?

Should this be done inside master/router_call or inside replicaset? Should this backoff be persistent? It looks like it should (so, when in the first master_call backoff achieved 10 sec, immediate second call of master_call with timeout 5 should not do a single request), so it's better to save it inside replicaset. But if the code will rely on returned messages, then it's unacceptable.

when master discovery fails multiple times

How will we distinguish master discovery failure from all other errors? You say, that this is not NON_MASTER error.

If master consistently reports garbage (for example, it says it is master, but refuses all write-mode requests), then the code shouldn't retry the calls right after each other. It might make sense to add a backoff timeout which grows with each weird error like that until some maximal value.

Do we want to backoff on all returned errors now? Currently it's only special ones, defined in can_backoff_after_error(). Or do we want to find all this "garbage" and write it there (or inside storage/router) (in such case I don't understand, how we will do it)?


If we're trying to reduce load on masters, then we should probably implement backoff right inside replicaset_master_call, which should back off on every failed request. For now, there's no backoff for masters at all (maybe there's a reason for that?).

If we're trying to reduce number of error messages, then we have to not log error, if it happens several times in a row everywhere it can happen, which I don't really like and don't understand, why this should be done.

Is the problem the fact, that too many errors are written to the log?

Yes.

master_call doesn't log any errors,

It does. It calls replicaset:update_master() which does the spam.

it's mostly services, which constantly retries call and log error, their behavior should be changed.

Services already have backoffs, they have nothing to do with this ticket, from what I saw.

Should this be done inside master/router_call or inside replicaset?

There is no place to add it inside replicaset. The retrying is done on the layer above, in storage and router code.

Should this backoff be persistent?

Doesn't have to be. Currently the problem appears with the single call.

How will we distinguish master discovery failure from all other errors?

NON_MASTER error. But it doesn't have to be just for this one. Router probably wouldn't need a backoff for the "not found bucket" error, it is a normal one.

Do we want to backoff on all returned errors now?

I honestly don't know. I didn't do any design for this ticket yet. I've only identified the issue and filed a ticket for it. I would appreciate, that if you want to do the ticket, then you would investigate it a bit more yourself, what are the options here.

If we're trying to reduce load on masters, then we should probably implement backoff right inside replicaset_master_call, which should back off on every failed request.

Could be done, right. I don't remember why wasn't it done for replicaset_master_call.

If we're trying to reduce load on masters, then we should probably implement backoff right inside replicaset_master_call, which should back off on every failed request.

Could be done, right. I don't remember why wasn't it done for replicaset_master_call.

I guess that a call to master assumes that the request can possibly write data. If it is unclear, whether the data is written (say, a network breaks after transmitting the request), we can't retry.

It doesn't mean that there are no error types that could be retried to master.

I don't know, whether it is the situation you're discussing, it is just general thought.

Reproduced on https://github.com/ImeevMA/tarantool/tree/imeevma/gh-8862-vshard-auto-masters. Caused by the router, which constantly retries a call, because locate_master returns is_master_auto, which is true. Vshard reproducer:

diff --git a/test/router-luatest/router_test.lua b/test/router-luatest/router_test.lua
index 28a3685..dbe2dc8 100644
--- a/test/router-luatest/router_test.lua
+++ b/test/router-luatest/router_test.lua
@@ -64,7 +64,7 @@ g.before_all(function()
 end)
 
 g.after_all(function()
-    g.cluster:drop()
+    g.cluster:stop()
 end)
 
 g.test_basic = function(g)
@@ -645,3 +645,30 @@ g.test_router_service_info = function(g)
     -- Restore everything back.
     vtest.router_cfg(g.router, global_cfg)
 end
+
+g.test_router_auto_master = function(g)
+    local new_cfg_template = table.deepcopy(cfg_template)
+    for _, rs in pairs(new_cfg_template.sharding) do
+        rs.master = 'auto'
+        for _, r in pairs(rs.replicas) do
+            r.master = nil
+        end
+    end
+
+    local new_cluster_cfg = vtest.config_new(new_cfg_template)
+    vtest.router_cfg(g.router, new_cluster_cfg)
+    vtest.cluster_cfg(g, new_cluster_cfg)
+
+    -- Change master of the first replicaset.
+    g.replica_1_a:eval('box.cfg{read_only = true}')
+    g.replica_1_b:eval('box.cfg{read_only = false}')
+
+    local bid = vtest.storage_first_bucket(g.replica_1_b)
+    local res, err = g.router:exec(callrw_get_uuid, {bid})
+    t.assert_equals(err, nil, 'no error')
+    t.assert_equals(res, g.replica_1_b:instance_uuid(), 'b is master')
+
+    -- Restore everything
+    vtest.cluster_cfg(g, global_cfg)
+    vtest.router_cfg(g.router, global_cfg)
+end
diff --git a/vshard/router/init.lua b/vshard/router/init.lua
index 4457b4c..316118b 100644
--- a/vshard/router/init.lua
+++ b/vshard/router/init.lua
@@ -1183,7 +1183,9 @@ local function master_search_service_f(router, service)
         end
         local timeout
         local start_time = fiber_clock()
+        lfiber.testcancel()
         local is_done, is_nop, err = master_search_step(router)
+        lfiber.testcancel()
         if err then
             log.error(service:set_status_error(
                 'Error during master search: %s', lerror.make(err)))

It takes 5 seconds for a new master no notice its transition. This is probably is caused by the fact, that router is spamming with calls without giving a fiber to wakeup and set is_master flag, but maybe there're other reasons for that.

Solution: rework backoff of replicaset.

  1. Make backoff interval not constant, but a growing timeout (changing the current behavior). Determine, whether backoff is allowed according to can_backoff_after_error. Min backoff: 0.5 sec, max: consts.REPLICA_BACKOFF_INTERVAL (5 sec). If the first call to replica after backoff fails, then
    new_backoff = old_backoff * 2.

  2. Add backoff to replicaset_master_call. They'll backoff according to can_backoff_after_error too.

  3. Make replica to backoff, when error.code == NON_MASTER and err.master_uuid == current_master:

    3.1 Simple way: add this check inside can_backoff_after_error. But this makes replicaset rely on vshard errors, which is not good.

    3.2 New API for replicaset to make replica go in backoff. master_call and router's call will manually make replica to backoff. This solution is preferable.

@Gerold103, are you ok with such solution?

Thanks for the investigation, sounds all good. Yes, 3.2 is preferable for now (until we find how to make replicaset rely on vshard errors safely or introduce a new internal API for it for using with just vshard functions and there we could rely on specific errors).