celo-org/celo-blockchain

Risk of Validator double signing due to inconsistent / wrong behavior of `istanbul.start()` and `istanbul.stop()`

erNail opened this issue · 0 comments

Problem/Motivation

When running a Celo validator, you can use the commands istanbul.start() and istanbul.stop() to make the validator start or stop validating. This is useful if you want to run redundant validators. If one validator is not running as expected, you stop him from validating, and you make your backup validator start validating.

We might have found an issue with this approach. istanbul.start() and istanbul.stop() don't always work as expected. In our case, istanbul.stop() did not stop the validator from validating. This can cause double signing issues when you execute istanbul.start() on another validator.

Expected Behavior

  1. When executing istanbul.stop(), a validator should stop validating. istanbul.replicaState.state should show "Replica". When istanbul.stop() throws an error, the validator should still be validating. istanbul.replicaState.state should show "Primary"
  2. When restarting a validator, the validator should be in the same replica state as before. If istanbul.replicaState.state showed "Primary" before restarting, it should show "Primary" after restarting
  3. A validator should not change it's replica state randomly

Current Behavior

The following happened while trying to fix a validator that stopped validating. We are not sure why it stopped and the logs also didn't indicate an issue, so the first thing we did was restart the validator. This did not help, so we tried to activate our backup validator. We were able to see the following behavior:

  1. When executing istanbul.stop() on the broken validator, an error was thrown (Sadly we don't have the exact error message anymore). istanbul.replicaState.state was showing "Primary"
  2. We restarted the validator once more. Now, istanbul.replicaState.state was showing "Replica". Even though we did not execute istanbul.stop() again.
  3. After some minutes, we checked istanbul.replicaState.state again. This time, it showed "Primary". Even though we did not execute istanbul.start().

TL;DR: Our validator did not become a "Replica" when we executed istanbul.stop(). After a restart it showed that it is a "Replica" however. After some minutes, it became a "Primary" again.

Steps to reproduce

We were not able to reproduce this behavior. Our guess is that we have an edge case on our hands, caused by the validator stopping to work correctly. If you were to somehow reproduce it, it would look something like this:

  1. Execute istanbul.stop() on a "Primary" validator that stopped working, receive an error, execute istanbul.replicaState.state and see the state "Primary"
  2. Restart the validator, execute istanbul.replicaState.state and see the state "Replica"
  3. Wait for a few minutes, execute istanbul.replicaState.state and see the state "Primary"

Additional information

We acknowledge that it might be impossible to debug/analyze this issue, since it can't be easily reproduced and more detailed information about the errors and logs are missing. We decided to create this issue anyway to make the validators and developers aware that there might be a bug that can increase the risk of double signing. We will add additional information if we see this issue again.