nimiq/core-rs-albatross

Full node cannot achieve consensus

Closed this issue · 3 comments

I have set up a new full node on bare metal (12x5.1GHz, 32GB RAM and 1TB NVME), this node does not seem able to achieve consensus.

You will see in the logs that I restarted it several times during the course of 9hrs, at one point even deleting the db.

This is not the first time I experienced this issue.

Log: https://www.transfernow.net/dl/20241002PGuWWsd9 (available for 7 days)

Extracting interesting data from the log:

  1. State sync never completes:
2024-10-02T13:24:52.277485003Z INFO  state_queue          | Received state sync chunk, ~99.99% complete start_key=fff8d7
  1. Client was then restarted, state sync reached 14.74% with previous rate limit errors:
2024-10-02T14:55:05.823080201Z DEBUG diff_request_compon… | couldn't fetch diff: Inbound error: Request exceeds the maximum rate limit peer_id=12D3KooWPwV3T3fwkavenKnWxT9e6wCvoQchok8X2YSRTB5WiLng block=#5708370:MA:53611d5dee num_tries=1 max_tries=14 error=InboundRequest(ExceedsRateLimit)
...
2024-10-02T14:55:24.105679088Z INFO  state_queue          | Received state sync chunk, ~14.74% complete start_key=25bb3f

The plot goes a little deeper, please find attached the same log with an extra few hours of logging.

The final state seems to be 'couldn't fetch diff: no peers'

Maybe two separate issues?

Log: https://www.transfernow.net/dl/20241002DE3om12U (available for 7 days)

we had a bug where peers (active validators) were being miscategorized as not synced. Thus we would ignore their blocks from the gossip sub, creating gaps during the syncing process. Every gap would generate a missing blocks request from the head of our node to the block announced in the gossip sub, thus creating too many overlapping requests across multiple batches, leading to this issue.

#2982 Fixed the miscategorization of the peers and #3008 reduces the amount of possible gaps induced but a gossipsub targetted attack.