moonbeam-foundation/moonbeam

Failed to receive a message from Overseer: Signal channel is terminated and empty.

Opened this issue · 3 comments

Moonbeam-skylake 0.33, operating as a full-node

I am running a fullnode and querying it extensively in localhost. After approx 70 blocks, the moonbeam node crashes with:

Oct 28 05:27:55 stakebaby-chalandri moonbeam[2508374]: 2023-10-28 05:27:55 [Relaychain] ✨ Imported #17915340 (0x8d94…96f5)
Oct 28 05:27:55 stakebaby-chalandri moonbeam[2508374]: 2023-10-28 05:27:55 [Relaychain] 💤 Idle (6 peers), best: #17915340 (0x8d94…96f5), finalized #17915337 (0x2545…00b5), ⬇ 30.2kiB/s ⬆ 7.0kiB/s
Oct 28 05:27:55 stakebaby-chalandri moonbeam[2508374]: 2023-10-28 05:27:55 [🌗] ⚙️  Preparing  0.0 bps, target=#4743523 (11 peers), best: #4743508 (0x95c1…3d43), finalized #4743505 (0x7bff…d019), ⬇ 4.9kiB/s ⬆ 89 B/s
Oct 28 05:27:56 stakebaby-chalandri moonbeam[2508374]: 2023-10-28 05:27:56 [Relaychain] cannot query the runtime API version: Api called for an unknown Block: State already discarded for 0x5adec8fe76ac16a0e2ff5bb1333dab8d683b67ab6fbda537c577511b3d8c511b
Oct 28 05:27:56 stakebaby-chalandri moonbeam[2508374]: 2023-10-28 05:27:56 [Relaychain] Failed to fetch runtime API data for job err=NotSupported { runtime_api_name: "validator_groups" }
Oct 28 05:27:56 stakebaby-chalandri moonbeam[2508374]: 2023-10-28 05:27:56 [Relaychain] cannot query the runtime API version: Api called for an unknown Block: State already discarded for 0x5adec8fe76ac16a0e2ff5bb1333dab8d683b67ab6fbda537c577511b3d8c511b
Oct 28 05:27:58 stakebaby-chalandri moonbeam[2508374]: 2023-10-28 05:27:58 [Relaychain] Failed to receive a message from Overseer, exiting err=Generated(Context("Signal channel is terminated and empty."))
Oct 28 05:27:58 stakebaby-chalandri moonbeam[2508374]: 2023-10-28 05:27:58 [Relaychain] err=Subsystem(Generated(Context("Signal channel is terminated and empty.")))
Oct 28 05:27:58 stakebaby-chalandri moonbeam[2508374]: 2023-10-28 05:27:58 [Relaychain] error receiving message from subsystem context: Generated(Context("Signal channel is terminated and empty.")) err=Generated(Context("Signal channel is terminated and empty."))
Oct 28 05:27:58 stakebaby-chalandri moonbeam[2508374]: 2023-10-28 05:27:58 [Relaychain] subsystem exited with error subsystem="statement-distribution-subsystem" err=FromOrigin { origin: "statement-distribution", source: SubsystemReceive(Generated(Context("Signal channel is terminated and empty."))) }
Oct 28 05:27:58 stakebaby-chalandri moonbeam[2508374]: 2023-10-28 05:27:58 [Relaychain] subsystem exited with error subsystem="network-bridge-rx-subsystem" err=FromOrigin { origin: "network-bridge", source: SubsystemError(Generated(Context("Signal channel is terminated and empty."))) }
Oct 28 05:27:58 stakebaby-chalandri moonbeam[2508374]: 2023-10-28 05:27:58 [Relaychain] subsystem exited with error subsystem="dispute-distribution-subsystem" err=FromOrigin { origin: "dispute-distribution", source: SubsystemReceive(Generated(Context("Signal channel is terminated and empty."))) }
Oct 28 05:27:58 stakebaby-chalandri moonbeam[2508374]: 2023-10-28 05:27:58 [Relaychain] subsystem exited with error subsystem="availability-recovery-subsystem" err=FromOrigin { origin: "availability-recovery", source: Generated(Context("Signal channel is terminated and empty.")) }
Oct 28 05:27:58 stakebaby-chalandri moonbeam[2508374]: 2023-10-28 05:27:58 [Relaychain] subsystem exited with error subsystem="bitfield-signing-subsystem" err=FromOrigin { origin: "bitfield-signing", source: Generated(Context("Signal channel is terminated and empty.")) }
Oct 28 05:27:58 stakebaby-chalandri moonbeam[2508374]: 2023-10-28 05:27:58 [Relaychain] subsystem exited with error subsystem="candidate-validation-subsystem" err=FromOrigin { origin: "candidate-validation", source: Generated(Context("Signal channel is terminated and empty.")) }
Oct 28 05:27:58 stakebaby-chalandri moonbeam[2508374]: 2023-10-28 05:27:58 [Relaychain] subsystem exited with error subsystem="provisioner-subsystem" err=FromOrigin { origin: "provisioner", source: OverseerExited(Generated(Context("Signal channel is terminated and empty."))) }
Oct 28 05:27:58 stakebaby-chalandri moonbeam[2508374]: 2023-10-28 05:27:58 [Relaychain] subsystem exited with error subsystem="network-bridge-tx-subsystem" err=FromOrigin { origin: "network-bridge", source: SubsystemError(Generated(Context("Signal channel is terminated and empty."))) }
Oct 28 05:27:58 stakebaby-chalandri moonbeam[2508374]: 2023-10-28 05:27:58 [Relaychain] subsystem exited with error subsystem="chain-api-subsystem" err=FromOrigin { origin: "chain-api", source: Generated(Context("Signal channel is terminated and empty.")) }
Oct 28 05:27:58 stakebaby-chalandri moonbeam[2508374]: 2023-10-28 05:27:58 [Relaychain] Overseer exited with error err=Generated(SubsystemStalled("approval-distribution-subsystem"))
Oct 28 05:27:58 stakebaby-chalandri moonbeam[2508374]: 2023-10-28 05:27:58 [Relaychain] Essential task `overseer` failed. Shutting down service.

This appears to be heavy-load or concurrency related, because the node does not crash if I ease down on the query rate. To put this in context, the node is queried by 7-12 NodeJS processes, each one of which can execute up to 300 queries concurrently. The moonbeam process averages 200%-450% of logical core capacity. I have tried different block spans, and the error persists, so I don't think it's related to db corruption.

Looks like it's a polkadot issue, but I am not sure if it has been resolved or ignored.
paritytech/polkadot#6624

Thank you @ioannist , We are aware of this issue but couldn't well reproduce it so maybe with your help we can pin point where it comes from

It eventually happens (within 50 blocks give or take) under heavy load. It does not happen if I only run one indexing worker (or one block at a time).

I've been paying around with flags and setting this, seems to avert the issue
--max-runtime-instances 256

looks like they are getting to the bottom of it here
paritytech/polkadot-sdk#840

our full node keeps crashing every 20 min or so on this error