likecoin/likecoin-chain

Node crashes every 1000 blocks with consensus failure error.

mgl2150 opened this issue · 11 comments

Dec 08 03:02:26 cyber liked[867908]: 3:02AM ERR CONSENSUS FAILURE!!! err="unable to delete version 1568000 with 1 active readers" module=consensus stack="goroutine 688 [running]:\nruntime/debug.Stack()\n\t/usr/lib/go/src/runtime/debug/stack.go:24 +0x65\ngithub.com/tendermint/tendermint/consensus.(*State).receiveRoutine.func2()\n\t/home/cyber/go/pkg/mod/github.com/tendermint/tendermint@v0.34.11/consensus/state.go:726 +0x4c\npanic({0x171ad40, 0xc0137922e8})\n\t/usr/lib/go/src/runtime/panic.go:1038 +0x215\ngithub.com/cosmos/cosmos-sdk/store/rootmulti.(*Store).pruneStores(0xc000510320)\n\t/home/cyber/go/pkg/mod/github.com/cosmos/cosmos-sdk@v0.42.9/store/rootmulti/store.go:388 +0x1b6\ngithub.com/cosmos/cosmos-sdk/store/rootmulti.(*Store).Commit(0xc000510320)\n\t/home/cyber/go/pkg/mod/github.com/cosmos/cosmos-sdk@v0.42.9/store/rootmulti/store.go:362 +0x175\ngithub.com/cosmos/cosmos-sdk/baseapp.(*BaseApp).Commit(0xc0013ffa00)\n\t/home/cyber/go/pkg/mod/github.com/cosmos/cosmos-sdk@v0.42.9/baseapp/abci.go:305 +0x22e\ngithub.com/tendermint/tendermint/abci/client.(*localClient).CommitSync(0xc001039680)\n\t/home/cyber/go/pkg/mod/github.com/tendermint/tendermint@v0.34.11/abci/client/local_client.go:258 +0xb6\ngithub.com/tendermint/tendermint/proxy.(*appConnConsensus).CommitSync(0x0)\n\t/home/cyber/go/pkg/mod/github.com/tendermint/tendermint@v0.34.11/proxy/app_conn.go:93 +0x22\ngithub.com/tendermint/tendermint/state.(*BlockExecutor).Commit(_, {{{0xb, 0x0}, {0xc009c57c59, 0x7}}, {0xc009c5dd58, 0x12}, 0x1, 0x17ed6e, {{0xc029b64be0, ...}, ...}, ...}, ...)\n\t/home/cyber/go/pkg/mod/github.com/tendermint/tendermint@v0.34.11/state/execution.go:228 +0x269\ngithub.com/tendermint/tendermint/state.(*BlockExecutor).ApplyBlock(_, {{{0xb, 0x0}, {0xc009c57c59, 0x7}}, {0xc009c5dd58, 0x12}, 0x1, 0x17ed6e, {{0xc029b64be0, ...}, ...}, ...}, ...)\n\t/home/cyber/go/pkg/mod/github.com/tendermint/tendermint@v0.34.11/state/execution.go:180 +0x6ee\ngithub.com/tendermint/tendermint/consensus.(*State).finalizeCommit(0xc0000e6a80, 0x17ed6e)\n\t/home/cyber/go/pkg/mod/github.com/tendermint/tendermint@v0.34.11/consensus/state.go:1635 +0x9fd\ngithub.com/tendermint/tendermint/consensus.(*State).tryFinalizeCommit(0xc0000e6a80, 0x17ed6e)\n\t/home/cyber/go/pkg/mod/github.com/tendermint/tendermint@v0.34.11/consensus/state.go:1546 +0x305\ngithub.com/tendermint/tendermint/consensus.(*State).enterCommit.func1()\n\t/home/cyber/go/pkg/mod/github.com/tendermint/tendermint@v0.34.11/consensus/state.go:1481 +0x87\ngithub.com/tendermint/tendermint/consensus.(*State).enterCommit(0xc0000e6a80, 0x17ed6e, 0x0)\n\t/home/cyber/go/pkg/mod/github.com/tendermint/tendermint@v0.34.11/consensus/state.go:1519 +0xc06\ngithub.com/tendermint/tendermint/consensus.(*State).addVote(0xc0000e6a80, 0xc01dac6500, {0xc00a7282a0, 0x28})\n\t/home/cyber/go/pkg/mod/github.com/tendermint/tendermint@v0.34.11/consensus/state.go:2132 +0xb6e\ngithub.com/tendermint/tendermint/consensus.(*State).tryAddVote(0xc0000e6a80, 0xc01dac6500, {0xc00a7282a0, 0xc00ddc8ea0})\n\t/home/cyber/go/pkg/mod/github.com/tendermint/tendermint@v0.34.11/consensus/state.go:1930 +0x2c\ngithub.com/tendermint/tendermint/consensus.(*State).handleMsg(0xc0000e6a80, {{0x1d0e220, 0xc010a8ba08}, {0xc00a7282a0, 0x0}})\n\t/home/cyber/go/pkg/mod/github.com/tendermint/tendermint@v0.34.11/consensus/state.go:838 +0x3dc\ngithub.com/tendermint/tendermint/consensus.(*State).receiveRoutine(0xc0000e6a80, 0x0)\n\t/home/cyber/go/pkg/mod/github.com/tendermint/tendermint@v0.34.11/consensus/state.go:762 +0x419\ncreated by github.com/tendermint/tendermint/consensus.(*State).OnStart\n\t/home/cyber/go/pkg/mod/github.com/tendermint/tendermint@v0.34.11/consensus/state.go:378 +0x12f\n"

Then run normally after we restart the node.

Probably your prune setting and statesync setting does not match, so that prune is trying to delete an active statesync snapshot
Can you post your prune and statesync snapshot config?

Thank you so much, I'll take a look at it. And when running the node I've seen this error, too. It's also solved by restarting the node:

Dec 09 05:36:52 cyber liked[905686]: 5:36AM INF Dialing peer address={"id":"e44a2165ac573f84151671b092aa4936ac305e2a","ip":"123.202.22.25","port":26656} module=p2p
Dec 09 05:36:53 cyber liked[905686]: panic: runtime error: invalid memory address or nil pointer dereference
Dec 09 05:36:53 cyber liked[905686]: [signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0xcec533]
Dec 09 05:36:53 cyber liked[905686]: goroutine 4370601 [running]:
Dec 09 05:36:53 cyber liked[905686]: github.com/cosmos/iavl.(*nodeDB).incrVersionReaders(0x1d48be8, 0x20)
Dec 09 05:36:53 cyber liked[905686]:         /home/cyber/go/pkg/mod/github.com/cosmos/iavl@v0.16.0/nodedb.go:619 +0x33
Dec 09 05:36:53 cyber liked[905686]: github.com/cosmos/iavl.newExporter(0xc02c6cd2a8)
Dec 09 05:36:53 cyber liked[905686]:         /home/cyber/go/pkg/mod/github.com/cosmos/iavl@v0.16.0/export.go:45 +0xc8
Dec 09 05:36:53 cyber liked[905686]: github.com/cosmos/iavl.(*ImmutableTree).Export(...)
Dec 09 05:36:53 cyber liked[905686]:         /home/cyber/go/pkg/mod/github.com/cosmos/iavl@v0.16.0/immutable_tree.go:142
Dec 09 05:36:53 cyber liked[905686]: github.com/cosmos/cosmos-sdk/store/iavl.(*Store).Export(0xc02e06ad80, 0x1d1f2d0)
Dec 09 05:36:53 cyber liked[905686]:         /home/cyber/go/pkg/mod/github.com/cosmos/cosmos-sdk@v0.42.9/store/iavl/store.go:239 +0x52
Dec 09 05:36:53 cyber liked[905686]: github.com/cosmos/cosmos-sdk/store/rootmulti.(*Store).Snapshot.func2()
Dec 09 05:36:53 cyber liked[905686]:         /home/cyber/go/pkg/mod/github.com/cosmos/cosmos-sdk@v0.42.9/store/rootmulti/store.go:660 +0x4b8
Dec 09 05:36:53 cyber liked[905686]: created by github.com/cosmos/cosmos-sdk/store/rootmulti.(*Store).Snapshot
Dec 09 05:36:53 cyber liked[905686]:         /home/cyber/go/pkg/mod/github.com/cosmos/cosmos-sdk@v0.42.9/store/rootmulti/store.go:627 +0x53b

Seeing the error stack sources from Snapshot I think its probably something to do with prune and statesync, maybe its the other way round, where statesync is snapshoting a pruned datastore, causing null pointer error

Well we're using systemd unit. This is the environment for prune setting:

Environment=LIKED_PRUNING=custom
Environment=LIKED_PRUNING_KEEP_RECENT=100
Environment=LIKED_PRUNING_KEEP_EVERY=0
Environment=LIKED_PRUNING_INTERVAL=10

Other chains use the same prune setting and none of them have such errors.

Can you attach your statesync snapshot
setting also? thanks

This one's in the config.toml file

[statesync]
# State sync rapidly bootstraps a new node by discovering, fetching, and restoring a state machine
# snapshot from peers instead of fetching and replaying historical blocks. Requires some peers in
# the network to take and serve state machine snapshots. State sync is not attempted if the node
# has any local state (LastBlockHeight > 0). The node will have a truncated block history,
# starting from the height of the snapshot.
enable = false

# RPC servers (comma-separated) for light client verification of the synced state machine and
# retrieval of state data for node bootstrapping. Also needs a trusted height and corresponding
# header hash obtained from a trusted source, and a period during which validators can be trusted.
#
# For Cosmos SDK-based chains, trust_period should usually be about 2/3 of the unbonding time (~2
# weeks) during which they can be financially punished (slashed) for misbehavior.
rpc_servers = ""
trust_height = 0
trust_hash = ""
trust_period = "168h0m0s"

# Time to spend discovering snapshots before initiating a restore.
discovery_time = "15s"

# Temporary directory for state sync snapshot chunks, defaults to the OS tempdir (typically /tmp).
# Will create a new, randomly named directory within, and remove it when done.
temp_dir = ""

# The timeout duration before re-requesting a chunk, possibly from a different
# peer (default: 1 minute).
chunk_request_timeout = "10s"

# The number of concurrent chunk fetchers to run (default: 1).
chunk_fetchers = "4"

Can you include the one in app.toml which is the snapshot setting? The one in config.toml is the client setting

oh sorry

[state-sync]

# snapshot-interval specifies the block interval at which local state sync snapshots are
# taken (0 to disable). Must be a multiple of pruning-keep-every.
snapshot-interval = 0

# snapshot-keep-recent specifies the number of recent snapshots to keep and serve (0 to keep all).
snapshot-keep-recent = 2

The config seems fine, but I think somehow statesync snapshot was active for these errors to occur
Would it be some env var override the config?
Will use the same config and try to reproduce this later from our side.

Please reopen if this happen again