status-im/infra-nimbus

Deploy dedicated Geth instances for all Prater Nimbus nodes

jakubgs opened this issue · 23 comments

This has been neglected due to other priorities, like Mainnet nodes, but it's time to do a proper setup of one Geth node for each Nimbus node, as God intended. This will require quite a lot of hardware as the prater fleet involves 11 hosts and 33 nodes in total.

Since the current size of a snap-synced Geth node is about ~160 GB:

admin@goerli-01.aws-eu-central-1a.nimbus.geth:~ % sudo du -hs /data/nimbus-goerli/node/data 
161G	/data/nimbus-goerli/node/data

We'll need at least 200 GB per node, and about 4 nodes on each host. So 1 TB NVMe should be sufficient for a while.
The most likely candidate is a Hetzner AX51-NVMe host with 2x1 TB NVMes:

Another possible option would be to replace existing hosts with bigger ones instead of adding separate hosts for Geth.

We could use Hetzner AX61-NVMe which have 2 x 1.92 TB NVMe which would be enough to run both Geth and Nimbus nodes on the same host, which would simplify setup, management, and debugging.

Based on conversation with @zah I'm purchasing six AX61-NVMe hosts:

image

After the migration the leftover AX41-NVMe hosts will be reused for macos and windows Geth nodes, as well as CI.

I've provisioned the hosts: 1bdcf1ca

linux-01.he-eu-hel1.nimbus.prater hostname=linux-01.he-eu-hel1.nimbus.prater ansible_host=95.217.198.113 env=nimbus stage=prater data_center=he-eu-hel1 region=eu-hel1 dns_entry=linux-01.he-eu-hel1.nimbus.prater.statusim.net
linux-02.he-eu-hel1.nimbus.prater hostname=linux-02.he-eu-hel1.nimbus.prater ansible_host=95.217.230.20 env=nimbus stage=prater data_center=he-eu-hel1 region=eu-hel1 dns_entry=linux-02.he-eu-hel1.nimbus.prater.statusim.net
linux-03.he-eu-hel1.nimbus.prater hostname=linux-03.he-eu-hel1.nimbus.prater ansible_host=65.108.132.230 env=nimbus stage=prater data_center=he-eu-hel1 region=eu-hel1 dns_entry=linux-03.he-eu-hel1.nimbus.prater.statusim.net
linux-04.he-eu-hel1.nimbus.prater hostname=linux-04.he-eu-hel1.nimbus.prater ansible_host=135.181.20.36 env=nimbus stage=prater data_center=he-eu-hel1 region=eu-hel1 dns_entry=linux-04.he-eu-hel1.nimbus.prater.statusim.net
linux-05.he-eu-hel1.nimbus.prater hostname=linux-05.he-eu-hel1.nimbus.prater ansible_host=95.217.224.92 env=nimbus stage=prater data_center=he-eu-hel1 region=eu-hel1 dns_entry=linux-05.he-eu-hel1.nimbus.prater.statusim.net
linux-06.he-eu-hel1.nimbus.prater hostname=linux-06.he-eu-hel1.nimbus.prater ansible_host=95.217.204.216 env=nimbus stage=prater data_center=he-eu-hel1 region=eu-hel1 dns_entry=linux-06.he-eu-hel1.nimbus.prater.statusim.net

And started deploying Geth nodes on them.

So far they have not started syncing:

admin@linux-01.he-eu-hel1.nimbus.prater:/docker/geth-goerli-02 % d 
CONTAINER ID   NAMES                 IMAGE                         CREATED          STATUS
1ac7ddb275f1   geth-goerli-04-node   ethereum/client-go:v1.10.23   15 minutes ago   Up 15 minutes
28c7a83695b7   geth-goerli-03-node   ethereum/client-go:v1.10.23   16 minutes ago   Up 16 minutes
2f2aa5925bfb   geth-goerli-02-node   ethereum/client-go:v1.10.23   18 minutes ago   Up 18 minutes
5970e7efbbab   geth-goerli-01-node   ethereum/client-go:v1.10.23   19 minutes ago   Up 19 minutes

admin@linux-01.he-eu-hel1.nimbus.prater:/docker/geth-goerli-02 % /docker/geth-goerli-02/rpc.sh eth_syncing                      
{
  "jsonrpc": "2.0",
  "id": 1,
  "result": false
}

admin@linux-01.he-eu-hel1.nimbus.prater:/docker/geth-goerli-02 % /docker/geth-goerli-01/rpc.sh admin_peers | jq '.result[].name'
"Geth/v1.10.21-stable/linux-amd64/go1.18.4"
"Geth/v1.10.23-stable-d901d853/linux-amd64/go1.18.5"
"Geth/v1.10.21-stable-67109427/linux-amd64/go1.18.5"
"erigon/v2022.99.99-dev-18f9313c/linux-amd64/go1.19"
"Geth/v1.10.23-stable-d901d853/linux-amd64/go1.18.5"

But that's probably because of low peer numbers.

I don't get it. The nodes are not syncing at all:

admin@linux-01.he-eu-hel1.nimbus.prater:~ % for dir in /docker/geth-goerli-*; do $dir/rpc.sh eth_syncing | jq -c; done
{"jsonrpc":"2.0","id":1,"result":false}
{"jsonrpc":"2.0","id":1,"result":false}
{"jsonrpc":"2.0","id":1,"result":false}
{"jsonrpc":"2.0","id":1,"result":false}

Despite having 50 peers each:

admin@linux-01.he-eu-hel1.nimbus.prater:~ % for dir in /docker/geth-goerli-*; do $dir/rpc.sh admin_peers | jq '.result | length'; done
50
50
50
50

But nothing has been synced:

admin@linux-01.he-eu-hel1.nimbus.prater:~ % sudo du -hsc /docker/geth-goerli-0?/node
3.2M	/docker/geth-goerli-01/node
3.0M	/docker/geth-goerli-02/node
7.2M	/docker/geth-goerli-03/node
3.4M	/docker/geth-goerli-04/node
17M	total

What the fuck is going on...

The startup logs show we are correctly using Goerli network:

INFO [08-31|08:33:31.510] Starting Geth on Görli testnet... 
...
INFO [08-31|08:33:31.580] Chain ID:  5 (goerli) 
INFO [08-31|08:33:31.580] Consensus: Beacon (proof-of-stake), merged from Clique (proof-of-authority) 
...
INFO [08-31|08:33:31.581] Initialising Ethereum protocol           network=5 dbversion=8
INFO [08-31|08:33:31.587] Loaded most recent local header          number=0 hash=bf7e33..b88c1a td=1 age=3y7mo2w
INFO [08-31|08:33:31.587] Loaded most recent local full block      number=0 hash=bf7e33..b88c1a td=1 age=3y7mo2w
INFO [08-31|08:33:31.587] Loaded most recent local fast block      number=0 hash=bf7e33..b88c1a td=1 age=3y7mo2w
INFO [08-31|08:33:31.589] Loaded local transaction journal         transactions=0 dropped=0
INFO [08-31|08:33:31.589] Regenerated local transaction journal    transactions=0 accounts=0
INFO [08-31|08:33:31.589] Chain post-merge, sync via beacon client 
INFO [08-31|08:33:31.589] Gasprice oracle is ignoring threshold set threshold=2
INFO [08-31|08:33:31.589] Allocated cache and file handles         database=/data/geth/les.server              cache=16.00MiB handles=16
INFO [08-31|08:33:31.593] Configured checkpoint oracle             address=0x18CA0E045F0D772a851BC7e48357Bcaab0a0795D signers=5 threshold=2

I don't get why it's not syncing.

There's a lot of Snapshot extension registration failed messages in the logs:

admin@linux-01.he-eu-hel1.nimbus.prater:~ % zcat /var/log/docker/geth-goerli-02-node/docker.*.gz | grep 'Snapshot extension registration failed' | wc -l
1370

admin@linux-01.he-eu-hel1.nimbus.prater:~ % cat /var/log/docker/geth-goerli-02-node/docker.log | grep 'Snapshot extension registration failed' | wc -l 
346

Not sure if that's relevant though.

Oooh, ok, now I see it:

WARN [08-31|08:39:06.628] Post-merge network, but no beacon client seen. Please launch one to follow the chain! 

We NEED a consensus layer node to learn what is the current head of the blockchain so we can start syncing the exec node.

And now we are finally syncing:

admin@linux-01.he-eu-hel1.nimbus.prater:~ % for dir in /docker/geth-goerli-*; do $dir/rpc.sh eth_syncing | jq -c '.result | { currentBlock, highestBlock }'; done
{"currentBlock":"0x2e488d","highestBlock":"0x727aee"}
{"currentBlock":"0x30f93e","highestBlock":"0x727af1"}
{"currentBlock":"0x25a9b8","highestBlock":"0x727af8"}
{"currentBlock":"0xc2b6f","highestBlock":"0x727b1f"}

Looks like for some reason some Geth nodes have fucked up ancient data:

INFO [08-31|15:08:52.349] Allocated cache and file handles         database=/data/geth/chaindata cache=15.71GiB handles=524,288
INFO [08-31|15:08:52.953] Opened ancient database                  database=/data/geth/chaindata/ancient/chain readonly=false
Fatal: Failed to register the Ethereum service: ancient chain segments already extracted, please set --datadir.ancient to the correct path
Fatal: Failed to register the Ethereum service: ancient chain segments already extracted, please set --datadir.ancient to the correct path

This issue suggests removing chaindata/ancient:

But that didn't help, and I had to remove all of chaindata to get the nodes to start syncing again:

admin@linux-01.he-eu-hel1.nimbus.prater:~ % for dir in /docker/geth-*; do $dir/rpc.sh eth_syncing | jq -c; done
{"jsonrpc":"2.0","id":1,"result":{"currentBlock":"0x4a26f5","healedBytecodeBytes":"0x0","healedBytecodes":"0x0","healedTrienodeBytes":"0x0","healedTrienodes":"0x0","healingBytecode":"0x0","healingTrienodes":"0x0","highestBlock":"0x728011","startingBlock":"0x0","syncedAccountBytes":"0x625e0c71","syncedAccounts":"0x61c1f9","syncedBytecodeBytes":"0x51f24ccd","syncedBytecodes":"0x2c626","syncedStorage":"0x3ffa08d","syncedStorageBytes":"0x3623d0241"}}
{"jsonrpc":"2.0","id":1,"result":{"currentBlock":"0x376d9c","healedBytecodeBytes":"0x0","healedBytecodes":"0x0","healedTrienodeBytes":"0x0","healedTrienodes":"0x0","healingBytecode":"0x0","healingTrienodes":"0x0","highestBlock":"0x72801a","startingBlock":"0x0","syncedAccountBytes":"0x53392659","syncedAccounts":"0x512202","syncedBytecodeBytes":"0x47d7abc6","syncedBytecodes":"0x272c1","syncedStorage":"0x37a6c54","syncedStorageBytes":"0x2efa02ecb"}}
{"jsonrpc":"2.0","id":1,"result":{"currentBlock":"0x44fab1","healedBytecodeBytes":"0x0","healedBytecodes":"0x0","healedTrienodeBytes":"0x0","healedTrienodes":"0x0","healingBytecode":"0x0","healingTrienodes":"0x0","highestBlock":"0x728050","startingBlock":"0x44fab1","syncedAccountBytes":"0x54e431b6","syncedAccounts":"0x561382","syncedBytecodeBytes":"0x48d78cb8","syncedBytecodes":"0x279a7","syncedStorage":"0x391d59b","syncedStorageBytes":"0x304293e9d"}}
{"jsonrpc":"2.0","id":1,"result":false}

I have stopped the nodes on the old metal hosts and migrated the validators.

  • b3ba3211 - nimbus.prater: drop old metal linux hosts
  • 87663366 - nimbus.prater: deploy Geth nodes on new hosts

We're already seeing attestations and proposals:

image

And we are seeing nimbus hosts proposing: https://prater.beaconcha.in/blocks?q=Nimbus%2Fv

image

Tomorrow I will reuse 3 of the 6 leftover old prater hosts to do Geth nodes for the AWS/MacOS/Windows hosts.

The remaining 3 hosts will be used for CI or decommissioned.

I configured a dedicated set of Geth nodes for Windows:

  • 0d7e29b8 - add geth-windows-01.he-eu-hel1.nimbus.prater host
  • cb448d64 - nimbus-prater-windows: deploy dedicated Geth nodes

windows-goerli-01.he-eu-hel1.nimbus.geth hostname=windows-goerli-01.he-eu-hel1.nimbus.geth ansible_host=65.21.196.47 env=nimbus stage=geth data_center=he-eu-hel1 region=eu-hel1 dns_entry=windows-goerli-01.he-eu-hel1.nimbus.geth.statusim.net

The sync progress is good, might start working tomorrow:

image

We finished syncing:

image

admin@windows-goerli-01.he-eu-hel1.nimbus.geth:~ % for dir in /docker/geth-*; do $dir/rpc.sh eth_syncing | jq -c; done
{"jsonrpc":"2.0","id":1,"result":false}
{"jsonrpc":"2.0","id":1,"result":false}
{"jsonrpc":"2.0","id":1,"result":false}
admin@windows-01 MINGW64 ~                                                                                                                        
$ for port in $(seq 9300 9302); do curl -sS "localhost:$port/eth/v1/node/syncing" | jq -c; done
{"data":{"head_slot":"3835538","sync_distance":"0","is_syncing":false,"is_optimistic":false}}
{"data":{"head_slot":"3835538","sync_distance":"0","is_syncing":false,"is_optimistic":false}}
{"data":{"head_slot":"3835538","sync_distance":"0","is_syncing":false,"is_optimistic":false}}

So Windows host is done.

Deployed a host for MacOS Prater nodes:

  • 2dd9350f - add macos-goerli-01.he-eu-hel1.nimbus.geth host

macos-goerli-01.he-eu-hel1.nimbus.geth hostname=macos-goerli-01.he-eu-hel1.nimbus.geth ansible_host=65.21.196.48 env=nimbus stage=geth data_center=he-eu-hel1 region=eu-hel1 dns_entry=macos-goerli-01.he-eu-hel1.nimbus.geth.statusim.net

Decided to rename the hosts while adding a third one so as to simplify setup:

  • 08a744da - rename Goerli geth nodes to be part of one fleet

goerli-01.he-eu-hel1.nimbus.geth hostname=goerli-01.he-eu-hel1.nimbus.geth ansible_host=65.21.73.183 env=nimbus stage=geth data_center=he-eu-hel1 region=eu-hel1 dns_entry=goerli-01.he-eu-hel1.nimbus.geth.statusim.net
goerli-02.he-eu-hel1.nimbus.geth hostname=goerli-02.he-eu-hel1.nimbus.geth ansible_host=65.21.196.48 env=nimbus stage=geth data_center=he-eu-hel1 region=eu-hel1 dns_entry=goerli-02.he-eu-hel1.nimbus.geth.statusim.net
goerli-03.he-eu-hel1.nimbus.geth hostname=goerli-03.he-eu-hel1.nimbus.geth ansible_host=65.21.196.47 env=nimbus stage=geth data_center=he-eu-hel1 region=eu-hel1 dns_entry=goerli-03.he-eu-hel1.nimbus.geth.statusim.net

Configured existing AWS and MacOS nodes to use the new Goerli Geth nodes:

  • e80d5943 - nimbus.prater: use EL clients from new Geth hosts

Currently syncing:

image

I think this is done:

 > a nimbus.prater -a 'for port in $(seq 9300 9305); do curl -s 0:$port/eth/v1/node/syncing | jq -c; done' 
stable-large-01.aws-eu-central-1a.nimbus.prater | CHANGED | rc=0 >>
{"data":{"head_slot":"3836702","sync_distance":"6822","is_syncing":true,"is_optimistic":true}}
unstable-large-01.aws-eu-central-1a.nimbus.prater | CHANGED | rc=0 >>
{"data":{"head_slot":"3839794","sync_distance":"3730","is_syncing":true,"is_optimistic":true}}
testing-large-01.aws-eu-central-1a.nimbus.prater | CHANGED | rc=0 >>
{"data":{"head_slot":"3843523","sync_distance":"1","is_syncing":false,"is_optimistic":true}}
macos-01.ms-eu-dublin.nimbus.prater | FAILED | rc=127 >>
{"data":{"head_slot":"3843847","sync_distance":"0","is_syncing":false,"is_optimistic":true}}
{"data":{"head_slot":"3843838","sync_distance":"9","is_syncing":false,"is_optimistic":true}}
{"data":{"head_slot":"3843836","sync_distance":"11","is_syncing":false,"is_optimistic":true}}
windows-01.he-eu-hel1.nimbus.prater | FAILED! => {
{"data":{"head_slot":"3836646","sync_distance":"6880","is_syncing":true,"is_optimistic":true}}                                                    
{"data":{"head_slot":"3836671","sync_distance":"6855","is_syncing":true,"is_optimistic":true}}                                                    
{"data":{"head_slot":"3836785","sync_distance":"6741","is_syncing":true,"is_optimistic":true}}
linux-01.he-eu-hel1.nimbus.prater | CHANGED | rc=0 >>
{"data":{"head_slot":"3843524","sync_distance":"0","is_syncing":false,"is_optimistic":false}}
{"data":{"head_slot":"3843524","sync_distance":"0","is_syncing":false,"is_optimistic":false}}
{"data":{"head_slot":"3843524","sync_distance":"0","is_syncing":false,"is_optimistic":false}}
{"data":{"head_slot":"3843524","sync_distance":"0","is_syncing":false,"is_optimistic":false}}
linux-02.he-eu-hel1.nimbus.prater | CHANGED | rc=0 >>
{"data":{"head_slot":"3843524","sync_distance":"0","is_syncing":false,"is_optimistic":false}}
{"data":{"head_slot":"3843524","sync_distance":"0","is_syncing":false,"is_optimistic":false}}
{"data":{"head_slot":"3843524","sync_distance":"0","is_syncing":false,"is_optimistic":false}}
{"data":{"head_slot":"3843524","sync_distance":"0","is_syncing":false,"is_optimistic":false}}
linux-03.he-eu-hel1.nimbus.prater | CHANGED | rc=0 >>
{"data":{"head_slot":"3843523","sync_distance":"1","is_syncing":false,"is_optimistic":false}}
{"data":{"head_slot":"3843523","sync_distance":"1","is_syncing":false,"is_optimistic":false}}
{"data":{"head_slot":"3843523","sync_distance":"1","is_syncing":false,"is_optimistic":false}}
{"data":{"head_slot":"3843523","sync_distance":"1","is_syncing":false,"is_optimistic":false}}
linux-04.he-eu-hel1.nimbus.prater | CHANGED | rc=0 >>
{"data":{"head_slot":"3843524","sync_distance":"0","is_syncing":false,"is_optimistic":false}}
{"data":{"head_slot":"3843524","sync_distance":"0","is_syncing":false,"is_optimistic":false}}
{"data":{"head_slot":"3843524","sync_distance":"0","is_syncing":false,"is_optimistic":false}}
{"data":{"head_slot":"3843524","sync_distance":"0","is_syncing":false,"is_optimistic":false}}
linux-05.he-eu-hel1.nimbus.prater | CHANGED | rc=0 >>
{"data":{"head_slot":"3843524","sync_distance":"0","is_syncing":false,"is_optimistic":false}}
{"data":{"head_slot":"3843524","sync_distance":"0","is_syncing":false,"is_optimistic":false}}
{"data":{"head_slot":"3843524","sync_distance":"0","is_syncing":false,"is_optimistic":false}}
{"data":{"head_slot":"3843524","sync_distance":"0","is_syncing":false,"is_optimistic":false}}
linux-06.he-eu-hel1.nimbus.prater | CHANGED | rc=0 >>
{"data":{"head_slot":"3843524","sync_distance":"0","is_syncing":false,"is_optimistic":false}}
{"data":{"head_slot":"3843524","sync_distance":"0","is_syncing":false,"is_optimistic":false}}
{"data":{"head_slot":"3843524","sync_distance":"0","is_syncing":false,"is_optimistic":false}}
{"data":{"head_slot":"3843524","sync_distance":"0","is_syncing":false,"is_optimistic":false}}

All are using the new Geth nodes. We can decommission the old AWS one.

Got rid of the old AWS Geth Goerli node:

  • 94816223 - drop goerli-01.aws-eu-central-1a.nimbus.geth host

I consider this done.