deso-protocol/run

Node suddenly shows "502 Bad Gateway" error message

ConfidenceYobo opened this issue · 18 comments

Everything works as normal but sometimes it suddenly shows "502 Bad Gateway" error and everything stops working until I restart the node - sometimes I may need to resync the node for everything to work as normal.

tijno commented

check logs for issue like "too many open files" - which is the most common cause for backend to crash causing the 502 Bad Gateway error.

Also check if you may be running out of memory.

Thanks for your response. I have checked, I am not running low on memory. I have used up only 2% of my memory and also have enough space on disk.

I have checked the log, I can't find any "too many open files" error, but I found this Server._handleTransactionBundle: Rejected transaction < TxHash: 4cb5bb4e968c37c98376ceb1c14aac74be1303bc309eddfc343f92ad3a5f42b7, TxnType: BC1YLiSpY6Ec9NWTNfmziLhSrrdB8dbVx4nspWAgkZgKic3Wxteiynx, PubKey: LIKE > from peer [ Remote Address: 34.123.41.111:17000 PeerID=2 ] from mempool: TxErrorDuplicate in the log

tijno commented

Those do happen often as a result of a crash - it may stop TXIndex keeping up with new blocks. But ive not seen it cause crashes.

What are some possible causes of crashes?

tijno commented

What i mentioned above

out of memory
out of files

also

out of discspace
server crash

But none of this is the case for me

tijno commented

I get this sometimes on the admin section of a node - and i have to logout from my bitclout account on the node and log back in for it to go away.

Are you seeing the same?

It happens mostly when am not logged in to the bitclout node but using the api

tijn commented

@tijno sorry for spamming the conversation again... but I keep getting notified now because of the tagline behind your name: "(BitClout @tijn)" 🤣

tijno commented

oh man github :) sorry @tijn ill change it

tijn commented

oh man github :) sorry @tijn ill change it

@tijno Thank you!

tijno commented

all done

fixed the issue by increasing the memory of the server to 64GB.

Hey -- wanted to drop a comment here as this has been happening on 8 nodes under my company's management. All of the machines have 30gb of memory, and we solve the OOMs by simply using docker's restart flag (I know, not a great option, but it works temporarily). After speaking with @tijno, he runs nodes on a 32gb machine, and max's at around 60% memory usage. I'll also note, that all eight of these nodes have been synced for an extended period of time, and these crashes occur quite randomly. The following OOM occurs:

[2025186.138224] Out of memory: Killed process 215489 (backend) total-vm:266280732kB, anon-rss:30178620kB, file-rss:0kB, shmem-rss:0kB, UID:0 pgtables:108952kB oom_score_adj:0
[2025187.222710] oom_reaper: reaped process 215489 (backend), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB

The OOMs have all been caused by rejected Duplicate Tx's:

E0831 14:01:03.874597 1 server.go:1311] Server._handleTransactionBundle: Rejected transaction < TxHash: 25452952cf8b3a8adc6f3412a2bcc4b9aa4e7960ec4d3052b8f4f8e1ff42d93c, TxnType: BC1YLhhrJUg1ms7P3YMQcjGPTVY9Tf8poJ1Xdeqt6AsoJ5g3zNvFz98, PubKey: PRIVATE_MESSAGE > from peer [ Remote Address: 34.123.41.111:17000 PeerID=5 ] from mempool: TxErrorDuplicate

While increasing memory is definitely a solution, and restarting on the crash is also.... something haha, I see no reason why a node can't run on a 30gb machine. My worry is that there's a potential memory leak, even though such is fairly uncommon in go... Beyond this, I have little idea why an already-synced node would require more than 30g -- especially since this is occurring uniformly across all 8 nodes under our management all after a Duplicate TX error is produced.

It is, of course, also possible that I'm just missing something. Would really appreciate any suggestions, as simply restarting the process after a crash isn't likely the best approach, let alone being effective long-term hahaha

We profile our nodes 24/7 and aren't aware of any memory leaks. Badger is a memory hog and is on its way out.

Makes sense -- thanks for the reply @maebeam

Glad to see badger go for a number of reasons hahaha