
Deploy dedicated Geth instances for all Prater Nimbus nodes

jakubgs opened this issue · 23 comments

This has been neglected due to other priorities, like Mainnet nodes, but it's time to do a proper setup of one Geth node for each Nimbus node, as God intended. This will require quite a lot of hardware as the prater fleet involves 11 hosts and 33 nodes in total.

Since the current size of a snap-synced Geth node is about ~160 GB: % sudo du -hs /data/nimbus-goerli/node/data 
161G	/data/nimbus-goerli/node/data

We'll need at least 200 GB per node, and about 4 nodes on each host. So 1 TB NVMe should be sufficient for a while.
The most likely candidate is a Hetzner AX51-NVMe host with 2x1 TB NVMes:

Another possible option would be to replace existing hosts with bigger ones instead of adding separate hosts for Geth.

We could use Hetzner AX61-NVMe which have 2 x 1.92 TB NVMe which would be enough to run both Geth and Nimbus nodes on the same host, which would simplify setup, management, and debugging.

Based on conversation with @zah I'm purchasing six AX61-NVMe hosts:


After the migration the leftover AX41-NVMe hosts will be reused for macos and windows Geth nodes, as well as CI.

I've provisioned the hosts: 1bdcf1ca

linux-01.he-eu-hel1.nimbus.prater hostname=linux-01.he-eu-hel1.nimbus.prater ansible_host= env=nimbus stage=prater data_center=he-eu-hel1 region=eu-hel1
linux-02.he-eu-hel1.nimbus.prater hostname=linux-02.he-eu-hel1.nimbus.prater ansible_host= env=nimbus stage=prater data_center=he-eu-hel1 region=eu-hel1
linux-03.he-eu-hel1.nimbus.prater hostname=linux-03.he-eu-hel1.nimbus.prater ansible_host= env=nimbus stage=prater data_center=he-eu-hel1 region=eu-hel1
linux-04.he-eu-hel1.nimbus.prater hostname=linux-04.he-eu-hel1.nimbus.prater ansible_host= env=nimbus stage=prater data_center=he-eu-hel1 region=eu-hel1
linux-05.he-eu-hel1.nimbus.prater hostname=linux-05.he-eu-hel1.nimbus.prater ansible_host= env=nimbus stage=prater data_center=he-eu-hel1 region=eu-hel1
linux-06.he-eu-hel1.nimbus.prater hostname=linux-06.he-eu-hel1.nimbus.prater ansible_host= env=nimbus stage=prater data_center=he-eu-hel1 region=eu-hel1

And started deploying Geth nodes on them.

So far they have not started syncing:

admin@linux-01.he-eu-hel1.nimbus.prater:/docker/geth-goerli-02 % d 
CONTAINER ID   NAMES                 IMAGE                         CREATED          STATUS
1ac7ddb275f1   geth-goerli-04-node   ethereum/client-go:v1.10.23   15 minutes ago   Up 15 minutes
28c7a83695b7   geth-goerli-03-node   ethereum/client-go:v1.10.23   16 minutes ago   Up 16 minutes
2f2aa5925bfb   geth-goerli-02-node   ethereum/client-go:v1.10.23   18 minutes ago   Up 18 minutes
5970e7efbbab   geth-goerli-01-node   ethereum/client-go:v1.10.23   19 minutes ago   Up 19 minutes

admin@linux-01.he-eu-hel1.nimbus.prater:/docker/geth-goerli-02 % /docker/geth-goerli-02/ eth_syncing                      
  "jsonrpc": "2.0",
  "id": 1,
  "result": false

admin@linux-01.he-eu-hel1.nimbus.prater:/docker/geth-goerli-02 % /docker/geth-goerli-01/ admin_peers | jq '.result[].name'

But that's probably because of low peer numbers.

I don't get it. The nodes are not syncing at all:

admin@linux-01.he-eu-hel1.nimbus.prater:~ % for dir in /docker/geth-goerli-*; do $dir/ eth_syncing | jq -c; done

Despite having 50 peers each:

admin@linux-01.he-eu-hel1.nimbus.prater:~ % for dir in /docker/geth-goerli-*; do $dir/ admin_peers | jq '.result | length'; done

But nothing has been synced:

admin@linux-01.he-eu-hel1.nimbus.prater:~ % sudo du -hsc /docker/geth-goerli-0?/node
3.2M	/docker/geth-goerli-01/node
3.0M	/docker/geth-goerli-02/node
7.2M	/docker/geth-goerli-03/node
3.4M	/docker/geth-goerli-04/node
17M	total

What the fuck is going on...

The startup logs show we are correctly using Goerli network:

INFO [08-31|08:33:31.510] Starting Geth on Görli testnet... 
INFO [08-31|08:33:31.580] Chain ID:  5 (goerli) 
INFO [08-31|08:33:31.580] Consensus: Beacon (proof-of-stake), merged from Clique (proof-of-authority) 
INFO [08-31|08:33:31.581] Initialising Ethereum protocol           network=5 dbversion=8
INFO [08-31|08:33:31.587] Loaded most recent local header          number=0 hash=bf7e33..b88c1a td=1 age=3y7mo2w
INFO [08-31|08:33:31.587] Loaded most recent local full block      number=0 hash=bf7e33..b88c1a td=1 age=3y7mo2w
INFO [08-31|08:33:31.587] Loaded most recent local fast block      number=0 hash=bf7e33..b88c1a td=1 age=3y7mo2w
INFO [08-31|08:33:31.589] Loaded local transaction journal         transactions=0 dropped=0
INFO [08-31|08:33:31.589] Regenerated local transaction journal    transactions=0 accounts=0
INFO [08-31|08:33:31.589] Chain post-merge, sync via beacon client 
INFO [08-31|08:33:31.589] Gasprice oracle is ignoring threshold set threshold=2
INFO [08-31|08:33:31.589] Allocated cache and file handles         database=/data/geth/les.server              cache=16.00MiB handles=16
INFO [08-31|08:33:31.593] Configured checkpoint oracle             address=0x18CA0E045F0D772a851BC7e48357Bcaab0a0795D signers=5 threshold=2

I don't get why it's not syncing.

There's a lot of Snapshot extension registration failed messages in the logs:

admin@linux-01.he-eu-hel1.nimbus.prater:~ % zcat /var/log/docker/geth-goerli-02-node/docker.*.gz | grep 'Snapshot extension registration failed' | wc -l

admin@linux-01.he-eu-hel1.nimbus.prater:~ % cat /var/log/docker/geth-goerli-02-node/docker.log | grep 'Snapshot extension registration failed' | wc -l 

Not sure if that's relevant though.

Oooh, ok, now I see it:

WARN [08-31|08:39:06.628] Post-merge network, but no beacon client seen. Please launch one to follow the chain! 

We NEED a consensus layer node to learn what is the current head of the blockchain so we can start syncing the exec node.

And now we are finally syncing:

admin@linux-01.he-eu-hel1.nimbus.prater:~ % for dir in /docker/geth-goerli-*; do $dir/ eth_syncing | jq -c '.result | { currentBlock, highestBlock }'; done

Looks like for some reason some Geth nodes have fucked up ancient data:

INFO [08-31|15:08:52.349] Allocated cache and file handles         database=/data/geth/chaindata cache=15.71GiB handles=524,288
INFO [08-31|15:08:52.953] Opened ancient database                  database=/data/geth/chaindata/ancient/chain readonly=false
Fatal: Failed to register the Ethereum service: ancient chain segments already extracted, please set --datadir.ancient to the correct path
Fatal: Failed to register the Ethereum service: ancient chain segments already extracted, please set --datadir.ancient to the correct path

This issue suggests removing chaindata/ancient:

But that didn't help, and I had to remove all of chaindata to get the nodes to start syncing again:

admin@linux-01.he-eu-hel1.nimbus.prater:~ % for dir in /docker/geth-*; do $dir/ eth_syncing | jq -c; done

I have stopped the nodes on the old metal hosts and migrated the validators.

  • b3ba3211 - nimbus.prater: drop old metal linux hosts
  • 87663366 - nimbus.prater: deploy Geth nodes on new hosts

We're already seeing attestations and proposals:


And we are seeing nimbus hosts proposing:


Tomorrow I will reuse 3 of the 6 leftover old prater hosts to do Geth nodes for the AWS/MacOS/Windows hosts.

The remaining 3 hosts will be used for CI or decommissioned.

I configured a dedicated set of Geth nodes for Windows:

  • 0d7e29b8 - add geth-windows-01.he-eu-hel1.nimbus.prater host
  • cb448d64 - nimbus-prater-windows: deploy dedicated Geth nodes

windows-goerli-01.he-eu-hel1.nimbus.geth hostname=windows-goerli-01.he-eu-hel1.nimbus.geth ansible_host= env=nimbus stage=geth data_center=he-eu-hel1 region=eu-hel1

The sync progress is good, might start working tomorrow:


We finished syncing:


admin@windows-goerli-01.he-eu-hel1.nimbus.geth:~ % for dir in /docker/geth-*; do $dir/ eth_syncing | jq -c; done
admin@windows-01 MINGW64 ~                                                                                                                        
$ for port in $(seq 9300 9302); do curl -sS "localhost:$port/eth/v1/node/syncing" | jq -c; done

So Windows host is done.

Deployed a host for MacOS Prater nodes:

  • 2dd9350f - add macos-goerli-01.he-eu-hel1.nimbus.geth host

macos-goerli-01.he-eu-hel1.nimbus.geth hostname=macos-goerli-01.he-eu-hel1.nimbus.geth ansible_host= env=nimbus stage=geth data_center=he-eu-hel1 region=eu-hel1

Decided to rename the hosts while adding a third one so as to simplify setup:

  • 08a744da - rename Goerli geth nodes to be part of one fleet

goerli-01.he-eu-hel1.nimbus.geth hostname=goerli-01.he-eu-hel1.nimbus.geth ansible_host= env=nimbus stage=geth data_center=he-eu-hel1 region=eu-hel1
goerli-02.he-eu-hel1.nimbus.geth hostname=goerli-02.he-eu-hel1.nimbus.geth ansible_host= env=nimbus stage=geth data_center=he-eu-hel1 region=eu-hel1
goerli-03.he-eu-hel1.nimbus.geth hostname=goerli-03.he-eu-hel1.nimbus.geth ansible_host= env=nimbus stage=geth data_center=he-eu-hel1 region=eu-hel1

Configured existing AWS and MacOS nodes to use the new Goerli Geth nodes:

  • e80d5943 - nimbus.prater: use EL clients from new Geth hosts

Currently syncing:


I think this is done:

 > a nimbus.prater -a 'for port in $(seq 9300 9305); do curl -s 0:$port/eth/v1/node/syncing | jq -c; done' | CHANGED | rc=0 >>
{"data":{"head_slot":"3836702","sync_distance":"6822","is_syncing":true,"is_optimistic":true}} | CHANGED | rc=0 >>
{"data":{"head_slot":"3839794","sync_distance":"3730","is_syncing":true,"is_optimistic":true}} | CHANGED | rc=0 >>
{"data":{"head_slot":"3843523","sync_distance":"1","is_syncing":false,"is_optimistic":true}} | FAILED | rc=127 >>
windows-01.he-eu-hel1.nimbus.prater | FAILED! => {
linux-01.he-eu-hel1.nimbus.prater | CHANGED | rc=0 >>
linux-02.he-eu-hel1.nimbus.prater | CHANGED | rc=0 >>
linux-03.he-eu-hel1.nimbus.prater | CHANGED | rc=0 >>
linux-04.he-eu-hel1.nimbus.prater | CHANGED | rc=0 >>
linux-05.he-eu-hel1.nimbus.prater | CHANGED | rc=0 >>
linux-06.he-eu-hel1.nimbus.prater | CHANGED | rc=0 >>

All are using the new Geth nodes. We can decommission the old AWS one.

Got rid of the old AWS Geth Goerli node:

  • 94816223 - drop host

I consider this done.