gnosischain/posdao-test-setup

Make watchguard script to set the `engine_signer` on the second validator's node when the first node goes offline

Opened this issue · 10 comments

Actually, we can't run two or more nodes by the same validator (with the same engine_signer) in AuRa: openethereum/parity-ethereum#10483

As a solution for having the second reserve validator's node, we can launch the second node without engine_signer and make a watchguard script which would catch the case when the first node goes offline, and right after that the script would use the parity_setEngineSigner RPC call for the second node to make it continue producing the blocks: openethereum/parity-ethereum#10483 (comment)

If the first node goes online again, we need to remove the engine_signer from the second node immediately.

Let's remove the nodes 4 and 5 from the test setup and leave the node 6 without the engine_signer but with such a watchguard script, and test the case when the node 3 goes offline and then goes back online.

With regards to the suggested AuRa solution, do we really need to be checking whether the signer signs recent blocks? It looks to be from the issue description that we only check whether the primary node is online or not, which is not achieved by looking at the blocks. The same signer will sign blocks when we enable the sealer on the secondary node. So, looking at the blocks does not reveal which of the nodes - the primary or the secondary - has sealed which block with the same block author.

I think maybe we need to have two watcher scripts (for both nodes): the script for the first node would work on the first node's machine, the script for the second node - on the second node's machine.

The first script would control somehow that the first node is working fine. The second script would ask the first script whether the first node is OK (through HTTP or socket connection).

Or there can be some other methods of how to determine if the node works fine and is connected to the network: https://wiki.parity.io/JSONRPC-parity-module.html

I use isListening to check whether the node is online. I imagine there might even be a third machine that monitors the primary and secondary nodes.

For security reasons, it might be best to keep validator node's rpc and other ports except 22 closed to the external world

Then RPC calls must be made locally on the secondary node.

Right, I was thinking if isListening would work in this case or not, because secondary can't connect to primary's RPC.

I imagine there might even be a third machine that monitors the primary and secondary nodes.

Three machines for each validator is too much, so we should solve this with only two node servers communicating with each other.

Can you elaborate on the job of the script on the primary node? I think that in our setup we only need one script - imagine it physically runs on the secondary machine. The script checks whether the primary node is running fine using isListening. Do we really need to split this function from the single script knowing that the test setup is local?

I think we should write a script that will be used in production too, not only for the test setup, right @varasev ?

I think we should write a script that will be used in production too, not only for the test setup, right @varasev ?

Yes, of course, this task is for production too.