deso-protocol/run

32 GB RAM, 8 vCPUs, 640 GB SSD - TXINDEX CRASH

AdonousTech opened this issue · 10 comments

Node running for 52 days with TXINDEX = false.

Decided to try to sync TXINDEX to enable notifications, and boom -- nothing but crashes.

tijno commented

check couple of things:

  • allocation of memory to docker - some are set to not have access to most memory
  • dmesg output for backend crash causes
  • too many open files issue
  • disable mining

even with fully synced and full tx index - my nodes crash every so often with weird cpu spikes.

@tijno - I'll check that out. Thanks for the help!

I can confirm TXINDEX is now fully synced. AS @tijno indicated, I've also experienced 2 random crashes since then. These crashes happen as a result of CPU spikes. Any ideas what's causing this?

Getting the node back up and running manually each time this happens is not scalable. For example, the last crash was at 2AM PST. I was asleep, so manual intervention is not feasible.

node_crash

Do you have any logs from around the time the spike occurred?

Are there any errors around the time of the crash?

@maebeam - Let me check that out and get back to you.

@maebeam - This appears to be the error logged around the time of the latest crash. I'm PST timezone (UTC-8)
Note - I checked the block explorer for that particular TxHash, and it does not exist.

2021-12-01T10:37:25.338464061Z E1201 10:37:25.338303 1 server.go:1311] Server._handleTransactionBundle: Rejected transaction < TxHash: 03a2417e04f2952d914ee9b489b41016092e090210d3367fa4542bb8a1842d90, TxnType: SUBMIT_POST, PubKey: BC1YLimyxAJxmdg9WPRndYueXaXB5P1cqNqi4dz7Rh7WkhdJR9e8Df7 > from peer [ Remote Address: 34.123.41.111:17000 PeerID=8 ] from mempool: tryAcceptTransaction: Problem connecting transaction after connecting dependencies: : ConnectTransaction: : _connectSubmitPost: error with _getParentAndGrandparentPostEntry: 03a2417e04f2952d914ee9b489b41016092e090210d3367fa4542bb8a1842d90: _getParentAndGrandparentPostEntry: failed to find parent post for post hash: 03a2417e04f2952d914ee9b489b41016092e090210d3367fa4542bb8a1842d90, parentStakeId: 59733ff9e9c3ed5a98dbd7324b17bd855feeb0becb4f5c6aa7aa123e1e33bf62: RuleErrorSubmitPostParentNotFound

Thanks. If you could upload the full log file that would be helpful

My apologies, full logs attached. I noticed a ton of these messages around the time of the error:
{"log":" from peer [ Remote Address: 34.123.41.111:17000 PeerID=8 ] was added as an ORPHAN\n","stream":"stderr","time":"2021-12-01T10:37:25.305064706Z"}

Logs (GDrive)
https://drive.google.com/drive/folders/1GTcGNHnXc_JQX1pUem5P7HaLxIgCpn1K?usp=sharing

Closing this as I believe the problem is related to a limitation with AWS. Running instances (Lightsail) accrue "burstable" vCPU performance time for each hour the instance runs. Instances in the "burstable" performance zone can only run at that vCPU utilization rate for so long. The length of run time is determined by the burst capacity reserve.

When the Nodes are syncing, TXINDEX utilizes a lot of vCPU. Although I'm running 8 vCPUs, each has a utilization limit of 17%, which is averaged across each vCPU.

So, the "crashes" are by design. It seems like a way to force you to plan for vCPU intensive work.

I've added additional alarms. I'll stop the services when my reserve capacity runs low. I'll continue the TXINDEX sync after the reserve is back to sustainable levels.