LiskArchive/lisk-sdk

Lisk-core node crashed with Invalid engine state error

Closed this issue · 5 comments

Expected behavior

Lisk core node should not crash. Error should be handled gracefully.

Actual behavior

Lisk core crashes with error:

/root/.pm2/logs/lisk-core-error.log:7393:2023-10-23T22:46:50.354Z ERROR sameer-ubuntu-base-s-2vcpu-4gb-120gb-intel-fra1-01 engine 3627583 [err=Event response for 'rpc-request' timed out] EVENT_FAILED_TO_FETCH_PEER_INFO: Failed to fetch peer info
/root/.pm2/logs/lisk-core-error.log:7598:2023-10-23T22:46:50.355Z ERROR sameer-ubuntu-base-s-2vcpu-4gb-120gb-intel-fra1-01 engine 3627583 [err=Event response for 'rpc-request' timed out] EVENT_FAILED_TO_FETCH_PEER_INFO: Failed to fetch peer info
/root/.pm2/logs/lisk-core-error.log-7803-    Error: Invalid engine state. Conflict at engine height 20662215 and 
/root/.pm2/logs/lisk-core-error.log-7876-    application state 20662214.

Node was completely synced with the network when this happened (Because the service indexing was in progress which only happens after node is synced).
After this error the node crashed with below error and if we restart it then it crashes with the same invalid engine state error.

Steps to reproduce

Start a lisk-core node with version 4.0.0-rc.6 from scratch with the following modified testnet config

{
	"system": {
		"dataPath": "~/.lisk",
		"logLevel": "debug",
		"keepEventsForHeights": -1
	},
	"rpc": {
		"modes": ["ipc", "ws", "http"],
		"port": 7887,
		"host": "0.0.0.0",
		"allowedMethods": ["*"]
	},
	"genesis": {
		"block": {
			"fromFile": "./config/genesis_block.blob"
		},
		"blockTime": 10,
		"chainID": "01000000",
		"maxTransactionsSize": 15360,
		"minimumCertifyHeight": 20520176
	},
	"network": {
		"version": "4.0",
		"seedPeers": [
			{
				"ip": "testnet-seed-01.lisk.com",
				"port": 7667
			},
			{
				"ip": "testnet-seed-02.lisk-nodes.net",
				"port": 7667
			},
			{
				"ip": "testnet-seed-03.lisk.com",
				"port": 7667
			},
			{
				"ip": "testnet-seed-04.lisk-nodes.net",
				"port": 7667
			},
			{
				"ip": "testnet-seed-05.lisk.com",
				"port": 7667
			}
		],
		"port": 7667
	},
	"transactionPool": {
		"maxTransactions": 4096,
		"maxTransactionsPerAccount": 64,
		"transactionExpiryTime": 10800000,
		"minEntranceFeePriority": "0",
		"minReplacementFeeDifference": "10"
	},
	"modules": {
		"dynamicReward": {
			"tokenID": "0100000000000000",
			"offset": 2160,
			"distance": 3000000,
			"brackets": ["500000000", "400000000", "300000000", "200000000", "100000000"]
		},
		"fee": {
			"feeTokenID": "0100000000000000"
		},
		"pos": {
			"maxBFTWeightCap": 1000,
			"useInvalidBLSKey": true
		}
	},
	"plugins": {}
}

Which version(s) does this affect? (Environment, OS, etc...)

6.0.0-rc.3

noticed same issue for 4.0.0-rc.7-dryrun-patch. I had run 4 instances of lisk-core. All of these eventually ended up with this error. The error height were different between the nodes.

config:

{
	"system": {
		"dataPath": "~/.lisk",
		"logLevel": "info",
		"keepEventsForHeights": -1
	},
	"rpc": {
		"modes": ["ipc", "ws"],
		"port": 7887,
		"host": "0.0.0.0",
		"allowedMethods": ["*"]
	},
	"genesis": {
		"block": {
			"fromFile": "./config/genesis_block.blob"
		},
		"blockTime": 10,
		"chainID": "99000000",
		"maxTransactionsSize": 15360,
		"minimumCertifyHeight": 23134934
	},
	"network": {
		"version": "5.0",
		"seedPeers": [
			{
				"ip": "mainnet-seed-01.liskdev.net",
				"port": 7667
			},
			{
				"ip": "mainnet-seed-02.liskdev.net",
				"port": 7667
			},
			{
				"ip": "mainnet-seed-03.liskdev.net",
				"port": 7667
			}
		],
		"port": 7667
	},
	"transactionPool": {
		"maxTransactions": 4096,
		"maxTransactionsPerAccount": 64,
		"transactionExpiryTime": 10800000,
		"minEntranceFeePriority": "0",
		"minReplacementFeeDifference": "10"
	},
	"modules": {
		"dynamicReward": {
			"tokenID": "9900000000000000",
			"offset": 2160,
			"distance": 3000000,
			"brackets": ["500000000", "400000000", "300000000", "200000000", "100000000"]
		},
		"fee": {
			"feeTokenID": "9900000000000000"
		},
		"pos": {
			"maxBFTWeightCap": 1000,
			"useInvalidBLSKey": true
		}
	},
	"plugins": {}
}

Can you (@vardan10 @priojeetpriyom) please also add more details from the log messages of the process, for one of the node, pm2 tries to restart the process, the message posted originally in the issue highlights the message error potentially raised when attempting to retry the execution, perhaps the original error is something else.

Attaching the entire error log for that run.
lisk-core-error__2023-10-25_00-00-00.log

@has5aan My observation so far is that this almost certainly happens on the instances when I download and import the blockchain snapshots. I have couple of servers where this currently exists and I'd imported the snapshot on them. Can grant you access to them on Monday.

Logs on the failing environment highlights that the node is crashing due to memory resource shortage, application shutdown process is skipped resulting in invalid block state; no change is necessary.