scylladb/scylla-rust-driver

speculative_execution test flakiness

Lorak-mmk opened this issue · 2 comments

speculative_execution_is_fired often fails in CI. We debugged it with @wprzytula . The problem is test assumes that each execution will be sent to different node which is no longer true in all cases after #944

Speculative execution uses load balancing Plan, which wraps usages of LBP::pick and LBP::fallback. If some target in fallback is the same as the one from pick, it is filtered out by Plan.
Before #944 Plan filtered out nodes, not targets (nodes + shards), which with guarantee that fallback doesn't return the same node twice guaranteed that nodes in Plan are unique.
Now those guarantees are no longer provided:

  • For unprepared queries shards returned from LBP are random, so it's possible for pick to return node X with shard 0 and for node X to appear in fallback with node 1. This won't be filtered out by Plan as those are different targets.
  • For prepared statements it's possible for the same node to appear in fallback multiple times. One case is with tablets - and that's intended. But it's possible even without tablets - first elements of fallback are replicas, and then other nodes in some order, with random shards. Those other nodes won't be filtered out because they are different targets (because of different nodes).

Solution after a long discussion with @wprzytula:

  • For unprepared queries ignore the problem - it will be possible for the first node in Plan to appear later with a different shard number. This is just one repetition, no more, so it shouldn't be a problem, as those are unprepared queries (so the user probably doesn't care that much about performance of such query).
  • Alternative solution for unprepared queries that we didn't discuss is to always return None from pick. Then fallback can take care of uniqness of nodes. WDYT @wprzytula ?
  • For prepared statements: don't eliminate repetitions in the first part of plan (that contains replicas). For the random part of the plan skip the nodes that appeared in the first part of the plan. This way there will be no repeating nodes in Plan except if 2 replicas happen to be on the same node.
* Alternative solution for unprepared queries that we didn't discuss is to always return `None` from `pick`. Then `fallback` can take care of uniqness of nodes. WDYT @wprzytula ?

Now that I think of it I'm really starting to like the idea. It will require some allocations in the unprepared case, but will provide consistent behavior in exchange.

I'm not in favour of additional allocations on the happy path of unprepared queries.