Node crashes every 1000 blocks with consensus failure error.

Question

Node crashes every 1000 blocks with consensus failure error.

mgl2150 opened this issue 3 years ago · 11 comments

Dec 08 03:02:26 cyber liked[867908]: 3:02AM ERR CONSENSUS FAILURE!!! err="unable to delete version 1568000 with 1 active readers" module=consensus stack="goroutine 688 [running]:\nruntime/debug.Stack()\n\t/usr/lib/go/src/runtime/debug/stack.go:24 +0x65\ngithub.com/tendermint/tendermint/consensus.(*State).receiveRoutine.func2()\n\t/home/cyber/go/pkg/mod/github.com/tendermint/tendermint@v0.34.11/consensus/state.go:726 +0x4c\npanic({0x171ad40, 0xc0137922e8})\n\t/usr/lib/go/src/runtime/panic.go:1038 +0x215\ngithub.com/cosmos/cosmos-sdk/store/rootmulti.(*Store).pruneStores(0xc000510320)\n\t/home/cyber/go/pkg/mod/github.com/cosmos/cosmos-sdk@v0.42.9/store/rootmulti/store.go:388 +0x1b6\ngithub.com/cosmos/cosmos-sdk/store/rootmulti.(*Store).Commit(0xc000510320)\n\t/home/cyber/go/pkg/mod/github.com/cosmos/cosmos-sdk@v0.42.9/store/rootmulti/store.go:362 +0x175\ngithub.com/cosmos/cosmos-sdk/baseapp.(*BaseApp).Commit(0xc0013ffa00)\n\t/home/cyber/go/pkg/mod/github.com/cosmos/cosmos-sdk@v0.42.9/baseapp/abci.go:305 +0x22e\ngithub.com/tendermint/tendermint/abci/client.(*localClient).CommitSync(0xc001039680)\n\t/home/cyber/go/pkg/mod/github.com/tendermint/tendermint@v0.34.11/abci/client/local_client.go:258 +0xb6\ngithub.com/tendermint/tendermint/proxy.(*appConnConsensus).CommitSync(0x0)\n\t/home/cyber/go/pkg/mod/github.com/tendermint/tendermint@v0.34.11/proxy/app_conn.go:93 +0x22\ngithub.com/tendermint/tendermint/state.(*BlockExecutor).Commit(_, {{{0xb, 0x0}, {0xc009c57c59, 0x7}}, {0xc009c5dd58, 0x12}, 0x1, 0x17ed6e, {{0xc029b64be0, ...}, ...}, ...}, ...)\n\t/home/cyber/go/pkg/mod/github.com/tendermint/tendermint@v0.34.11/state/execution.go:228 +0x269\ngithub.com/tendermint/tendermint/state.(*BlockExecutor).ApplyBlock(_, {{{0xb, 0x0}, {0xc009c57c59, 0x7}}, {0xc009c5dd58, 0x12}, 0x1, 0x17ed6e, {{0xc029b64be0, ...}, ...}, ...}, ...)\n\t/home/cyber/go/pkg/mod/github.com/tendermint/tendermint@v0.34.11/state/execution.go:180 +0x6ee\ngithub.com/tendermint/tendermint/consensus.(*State).finalizeCommit(0xc0000e6a80, 0x17ed6e)\n\t/home/cyber/go/pkg/mod/github.com/tendermint/tendermint@v0.34.11/consensus/state.go:1635 +0x9fd\ngithub.com/tendermint/tendermint/consensus.(*State).tryFinalizeCommit(0xc0000e6a80, 0x17ed6e)\n\t/home/cyber/go/pkg/mod/github.com/tendermint/tendermint@v0.34.11/consensus/state.go:1546 +0x305\ngithub.com/tendermint/tendermint/consensus.(*State).enterCommit.func1()\n\t/home/cyber/go/pkg/mod/github.com/tendermint/tendermint@v0.34.11/consensus/state.go:1481 +0x87\ngithub.com/tendermint/tendermint/consensus.(*State).enterCommit(0xc0000e6a80, 0x17ed6e, 0x0)\n\t/home/cyber/go/pkg/mod/github.com/tendermint/tendermint@v0.34.11/consensus/state.go:1519 +0xc06\ngithub.com/tendermint/tendermint/consensus.(*State).addVote(0xc0000e6a80, 0xc01dac6500, {0xc00a7282a0, 0x28})\n\t/home/cyber/go/pkg/mod/github.com/tendermint/tendermint@v0.34.11/consensus/state.go:2132 +0xb6e\ngithub.com/tendermint/tendermint/consensus.(*State).tryAddVote(0xc0000e6a80, 0xc01dac6500, {0xc00a7282a0, 0xc00ddc8ea0})\n\t/home/cyber/go/pkg/mod/github.com/tendermint/tendermint@v0.34.11/consensus/state.go:1930 +0x2c\ngithub.com/tendermint/tendermint/consensus.(*State).handleMsg(0xc0000e6a80, {{0x1d0e220, 0xc010a8ba08}, {0xc00a7282a0, 0x0}})\n\t/home/cyber/go/pkg/mod/github.com/tendermint/tendermint@v0.34.11/consensus/state.go:838 +0x3dc\ngithub.com/tendermint/tendermint/consensus.(*State).receiveRoutine(0xc0000e6a80, 0x0)\n\t/home/cyber/go/pkg/mod/github.com/tendermint/tendermint@v0.34.11/consensus/state.go:762 +0x419\ncreated by github.com/tendermint/tendermint/consensus.(*State).OnStart\n\t/home/cyber/go/pkg/mod/github.com/tendermint/tendermint@v0.34.11/consensus/state.go:378 +0x12f\n"

Then run normally after we restart the node.

Answer 1 · 2021-12-08T19:47:28.000Z

Probably your prune setting and statesync setting does not match, so that prune is trying to delete an active statesync snapshot
Can you post your prune and statesync snapshot config?

Answer 2 · 2021-12-08T22:40:50.000Z

Thank you so much, I'll take a look at it. And when running the node I've seen this error, too. It's also solved by restarting the node:

Dec 09 05:36:52 cyber liked[905686]: 5:36AM INF Dialing peer address={"id":"e44a2165ac573f84151671b092aa4936ac305e2a","ip":"123.202.22.25","port":26656} module=p2p
Dec 09 05:36:53 cyber liked[905686]: panic: runtime error: invalid memory address or nil pointer dereference
Dec 09 05:36:53 cyber liked[905686]: [signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0xcec533]
Dec 09 05:36:53 cyber liked[905686]: goroutine 4370601 [running]:
Dec 09 05:36:53 cyber liked[905686]: github.com/cosmos/iavl.(*nodeDB).incrVersionReaders(0x1d48be8, 0x20)
Dec 09 05:36:53 cyber liked[905686]:         /home/cyber/go/pkg/mod/github.com/cosmos/iavl@v0.16.0/nodedb.go:619 +0x33
Dec 09 05:36:53 cyber liked[905686]: github.com/cosmos/iavl.newExporter(0xc02c6cd2a8)
Dec 09 05:36:53 cyber liked[905686]:         /home/cyber/go/pkg/mod/github.com/cosmos/iavl@v0.16.0/export.go:45 +0xc8
Dec 09 05:36:53 cyber liked[905686]: github.com/cosmos/iavl.(*ImmutableTree).Export(...)
Dec 09 05:36:53 cyber liked[905686]:         /home/cyber/go/pkg/mod/github.com/cosmos/iavl@v0.16.0/immutable_tree.go:142
Dec 09 05:36:53 cyber liked[905686]: github.com/cosmos/cosmos-sdk/store/iavl.(*Store).Export(0xc02e06ad80, 0x1d1f2d0)
Dec 09 05:36:53 cyber liked[905686]:         /home/cyber/go/pkg/mod/github.com/cosmos/cosmos-sdk@v0.42.9/store/iavl/store.go:239 +0x52
Dec 09 05:36:53 cyber liked[905686]: github.com/cosmos/cosmos-sdk/store/rootmulti.(*Store).Snapshot.func2()
Dec 09 05:36:53 cyber liked[905686]:         /home/cyber/go/pkg/mod/github.com/cosmos/cosmos-sdk@v0.42.9/store/rootmulti/store.go:660 +0x4b8
Dec 09 05:36:53 cyber liked[905686]: created by github.com/cosmos/cosmos-sdk/store/rootmulti.(*Store).Snapshot
Dec 09 05:36:53 cyber liked[905686]:         /home/cyber/go/pkg/mod/github.com/cosmos/cosmos-sdk@v0.42.9/store/rootmulti/store.go:627 +0x53b

Answer 3 · 2021-12-09T19:04:33.000Z

Seeing the error stack sources from Snapshot I think its probably something to do with prune and statesync, maybe its the other way round, where statesync is snapshoting a pruned datastore, causing null pointer error

Answer 4 · 2021-12-09T19:41:49.000Z

Well we're using systemd unit. This is the environment for prune setting:

Environment=LIKED_PRUNING=custom
Environment=LIKED_PRUNING_KEEP_RECENT=100
Environment=LIKED_PRUNING_KEEP_EVERY=0
Environment=LIKED_PRUNING_INTERVAL=10

Answer 5 · 2021-12-09T19:42:43.000Z

Other chains use the same prune setting and none of them have such errors.

Answer 6 · 2021-12-09T19:45:19.000Z

Can you attach your statesync snapshot
setting also? thanks

Answer 7 · 2021-12-09T19:48:39.000Z

This one's in the config.toml file

[statesync]
# State sync rapidly bootstraps a new node by discovering, fetching, and restoring a state machine
# snapshot from peers instead of fetching and replaying historical blocks. Requires some peers in
# the network to take and serve state machine snapshots. State sync is not attempted if the node
# has any local state (LastBlockHeight > 0). The node will have a truncated block history,
# starting from the height of the snapshot.
enable = false

# RPC servers (comma-separated) for light client verification of the synced state machine and
# retrieval of state data for node bootstrapping. Also needs a trusted height and corresponding
# header hash obtained from a trusted source, and a period during which validators can be trusted.
#
# For Cosmos SDK-based chains, trust_period should usually be about 2/3 of the unbonding time (~2
# weeks) during which they can be financially punished (slashed) for misbehavior.
rpc_servers = ""
trust_height = 0
trust_hash = ""
trust_period = "168h0m0s"

# Time to spend discovering snapshots before initiating a restore.
discovery_time = "15s"

# Temporary directory for state sync snapshot chunks, defaults to the OS tempdir (typically /tmp).
# Will create a new, randomly named directory within, and remove it when done.
temp_dir = ""

# The timeout duration before re-requesting a chunk, possibly from a different
# peer (default: 1 minute).
chunk_request_timeout = "10s"

# The number of concurrent chunk fetchers to run (default: 1).
chunk_fetchers = "4"

Answer 8 · 2021-12-09T20:42:38.000Z

Can you include the one in app.toml which is the snapshot setting? The one in config.toml is the client setting

Answer 9 · 2021-12-10T17:32:25.000Z

oh sorry

[state-sync]

# snapshot-interval specifies the block interval at which local state sync snapshots are
# taken (0 to disable). Must be a multiple of pruning-keep-every.
snapshot-interval = 0

# snapshot-keep-recent specifies the number of recent snapshots to keep and serve (0 to keep all).
snapshot-keep-recent = 2

Answer 10 · 2021-12-11T01:52:11.000Z

The config seems fine, but I think somehow statesync snapshot was active for these errors to occur
Would it be some env var override the config?
Will use the same config and try to reproduce this later from our side.

Answer 11 · 2022-02-09T13:07:49.000Z

Please reopen if this happen again