speculative_execution test flakiness
Lorak-mmk opened this issue · 2 comments
speculative_execution_is_fired
often fails in CI. We debugged it with @wprzytula . The problem is test assumes that each execution will be sent to different node which is no longer true in all cases after #944
Speculative execution uses load balancing Plan
, which wraps usages of LBP::pick and LBP::fallback. If some target in fallback is the same as the one from pick, it is filtered out by Plan
.
Before #944 Plan
filtered out nodes, not targets (nodes + shards), which with guarantee that fallback doesn't return the same node twice guaranteed that nodes in Plan
are unique.
Now those guarantees are no longer provided:
- For unprepared queries shards returned from LBP are random, so it's possible for
pick
to return node X with shard 0 and for node X to appear infallback
with node 1. This won't be filtered out byPlan
as those are different targets. - For prepared statements it's possible for the same node to appear in
fallback
multiple times. One case is with tablets - and that's intended. But it's possible even without tablets - first elements offallback
are replicas, and then other nodes in some order, with random shards. Those other nodes won't be filtered out because they are different targets (because of different nodes).
Solution after a long discussion with @wprzytula:
- For unprepared queries ignore the problem - it will be possible for the first node in
Plan
to appear later with a different shard number. This is just one repetition, no more, so it shouldn't be a problem, as those are unprepared queries (so the user probably doesn't care that much about performance of such query). - Alternative solution for unprepared queries that we didn't discuss is to always return
None
frompick
. Thenfallback
can take care of uniqness of nodes. WDYT @wprzytula ? - For prepared statements: don't eliminate repetitions in the first part of plan (that contains replicas). For the random part of the plan skip the nodes that appeared in the first part of the plan. This way there will be no repeating nodes in
Plan
except if 2 replicas happen to be on the same node.
* Alternative solution for unprepared queries that we didn't discuss is to always return `None` from `pick`. Then `fallback` can take care of uniqness of nodes. WDYT @wprzytula ?
Now that I think of it I'm really starting to like the idea. It will require some allocations in the unprepared case, but will provide consistent behavior in exchange.
I'm not in favour of additional allocations on the happy path of unprepared queries.