renproject/darknode-cli

Allow darknode migration across cloud provider regions

bakebrain opened this issue · 2 comments

Summary

Allow a migration of darknodes across cloud provider regions - this is related to #22 - but staying with the currently selected provider.

Basic example (aws)

darknode migrate YOUR-DARKNODE-NAME --source-aws-region sa-east-1 --target-aws-region eu-west-1

Motivation

  • allow an "internal" migration and avoid the lengthy deregister/destroy process - saving time, money, fees...
  • react to issues within the current region, e.g. longer downtimes, instance type unavailability changes etc.
  • potentially save cost since pricing across "random" regions differ

While this would be a great feature to have, it is very difficult to achieve without risk for the node operation.

Option 1:

a. setup new node, but keep it off (easy)
b. keep new node up-to-date with all of the internal state of the old node (very hard)
c. turn off old node and then turn on new node without losing track of messages in-flight (potentially impossible)

Without achieving these four steps reliably, the new node would be at risk of broadcasting information that was at odds with the old node. In the worst case, for example, it might see itself proposing two different blocks in the same height/round, and this would result in slashing.

Option 2:

a. setup new node, but keep it off (easy)
b. turn off old node (easy)
c. download state and upload to new node (easy, but slow)
d. turn on new node (easy)

This solves the problem in (1), by effectively cloning the state from the old node. This has to be done while the old node is off (and before the new node is on), otherwise the cloned state can become stale and we are have the same problem as (1) again. But, step (2.b) is slow and could result in extended down-time for the node. Especially if there is an intermittent connectivity issue between the node operator's workspace and the node. Storage space can be several GBs, so downloading and then uploading the entire backup will take time. Then, the new node will be (at best) a few minutes behind the rest of the network and need to re-synchronise with it.

It is more than possible that this takes too long, and the node begins begin slashed (and is forcibly deregistered) for not being online.

Decided not to label as wont fix, and instead label as help wanted. Open to suggestions here, but unless something compelling is suggested that minimises risk to the node operator, I am not sure migrations like this will actually be possible.