Risk of Validator double signing due to inconsistent / wrong behavior of `istanbul.start()` and `istanbul.stop()`
erNail opened this issue · 0 comments
Problem/Motivation
When running a Celo validator, you can use the commands istanbul.start()
and istanbul.stop()
to make the validator start or stop validating. This is useful if you want to run redundant validators. If one validator is not running as expected, you stop him from validating, and you make your backup validator start validating.
We might have found an issue with this approach. istanbul.start()
and istanbul.stop()
don't always work as expected. In our case, istanbul.stop()
did not stop the validator from validating. This can cause double signing issues when you execute istanbul.start()
on another validator.
Expected Behavior
- When executing
istanbul.stop()
, a validator should stop validating.istanbul.replicaState.state
should show"Replica"
. Whenistanbul.stop()
throws an error, the validator should still be validating.istanbul.replicaState.state
should show"Primary"
- When restarting a validator, the validator should be in the same replica state as before. If
istanbul.replicaState.state
showed"Primary"
before restarting, it should show"Primary"
after restarting - A validator should not change it's replica state randomly
Current Behavior
The following happened while trying to fix a validator that stopped validating. We are not sure why it stopped and the logs also didn't indicate an issue, so the first thing we did was restart the validator. This did not help, so we tried to activate our backup validator. We were able to see the following behavior:
- When executing
istanbul.stop()
on the broken validator, an error was thrown (Sadly we don't have the exact error message anymore).istanbul.replicaState.state
was showing"Primary"
- We restarted the validator once more. Now,
istanbul.replicaState.state
was showing"Replica"
. Even though we did not executeistanbul.stop()
again. - After some minutes, we checked
istanbul.replicaState.state
again. This time, it showed"Primary"
. Even though we did not executeistanbul.start()
.
TL;DR: Our validator did not become a "Replica"
when we executed istanbul.stop()
. After a restart it showed that it is a "Replica"
however. After some minutes, it became a "Primary"
again.
Steps to reproduce
We were not able to reproduce this behavior. Our guess is that we have an edge case on our hands, caused by the validator stopping to work correctly. If you were to somehow reproduce it, it would look something like this:
- Execute
istanbul.stop()
on a"Primary"
validator that stopped working, receive an error, executeistanbul.replicaState.state
and see the state"Primary"
- Restart the validator, execute
istanbul.replicaState.state
and see the state"Replica"
- Wait for a few minutes, execute
istanbul.replicaState.state
and see the state"Primary"
Additional information
We acknowledge that it might be impossible to debug/analyze this issue, since it can't be easily reproduced and more detailed information about the errors and logs are missing. We decided to create this issue anyway to make the validators and developers aware that there might be a bug that can increase the risk of double signing. We will add additional information if we see this issue again.