ChorusOne/solido

Reducing races between maintainer instances

ruuda opened this issue · 3 comments

ruuda commented

When there is maintenance to perform, all instances of the maintainer bot will submit a transaction to do the operation, but only one of those will succeed, and the others are unlucky.

To prevent those races, we could assign every maintainer some slots in which they are the "leader", and if a maintainer is not a leader, it will not submit transactions. This can still lead to races at the boundaries, but it should greatly reduce the number of races we see.

We expect all maintainers to cooperate, so we can select the "leader" without coordination by mapping slots to maintainers. We can do something like

leader_index = (slot_number // 100) % num_maintainers

where // is integer division. Increasing the constant (100 above) will lead to fewer races, but also longer delays until maintenance happens, when a maintainer is offline. Decreasing the constant will improve responsiveness at the cost of more races. I think ~100 would be a good value, with the current 650ms block times, that means the leader rotates about once per minute.

That should decrease the races ^^
But we should aim for these maintainer functions to be permissionless in the future

ruuda commented

Permissionless maintainers will make it worse actually, because then you no longer know who else is participating, and you can’t coordinate to avoid the races. But I guess that’s the price of permissionlessness.

ruuda commented

I have a draft implementation for this in the maintainer-duty branch, and it will prevent maintainers from taking action when they are not “on duty”.

However, there is a bad interaction with the current polling setup: if the maintainer goes to sleep just before their duty, it might oversleep, and never perform their duty. I haven’t worked out the math for how bad this is and what the expected latency is; I think it’s better to adjust the sleeping logic so the maintainer will only sleep until it’s their turn; we can time it such that the maintainer wakes up at the right moment.

To make that a bit nicer to implement, I am refactoring the daemon loop a bit. While I’m at it I’ll tackle #420.