Node suddenly shows "502 Bad Gateway" error message

Question

Node suddenly shows "502 Bad Gateway" error message

ConfidenceYobo opened this issue 3 years ago · 18 comments

Everything works as normal but sometimes it suddenly shows "502 Bad Gateway" error and everything stops working until I restart the node - sometimes I may need to resync the node for everything to work as normal.

tijno commented 3 years ago

all done

Answer 1 · 2021-08-16T14:48:05.000Z

check logs for issue like "too many open files" - which is the most common cause for backend to crash causing the 502 Bad Gateway error.

Also check if you may be running out of memory.

Answer 2 · 2021-08-16T14:52:21.000Z

Thanks for your response. I have checked, I am not running low on memory. I have used up only 2% of my memory and also have enough space on disk.

Answer 3 · 2021-08-16T14:55:00.000Z

I have checked the log, I can't find any "too many open files" error, but I found this Server._handleTransactionBundle: Rejected transaction < TxHash: 4cb5bb4e968c37c98376ceb1c14aac74be1303bc309eddfc343f92ad3a5f42b7, TxnType: BC1YLiSpY6Ec9NWTNfmziLhSrrdB8dbVx4nspWAgkZgKic3Wxteiynx, PubKey: LIKE > from peer [ Remote Address: 34.123.41.111:17000 PeerID=2 ] from mempool: TxErrorDuplicate in the log

Answer 4 · 2021-08-16T15:13:07.000Z

Those do happen often as a result of a crash - it may stop TXIndex keeping up with new blocks. But ive not seen it cause crashes.

Answer 5 · 2021-08-16T15:15:49.000Z

What are some possible causes of crashes?

Answer 6 · 2021-08-16T15:21:57.000Z

What i mentioned above

out of memory
out of files

also

out of discspace
server crash

Answer 7 · 2021-08-16T15:23:48.000Z

But none of this is the case for me

Answer 8 · 2021-08-16T17:30:37.000Z

I get this sometimes on the admin section of a node - and i have to logout from my bitclout account on the node and log back in for it to go away.

Are you seeing the same?

Answer 9 · 2021-08-16T17:38:30.000Z

I had that on mine when it was running.

…

On Aug 16, 2021, 10:30 AM -0700, BitClout @tijn ***@***.***>, wrote: I get this sometimes on the admin section of a node - and i have to logout from my bitclout account on the node and log back in for it to go away. Are you seeing the same? — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or unsubscribe.

Answer 10 · 2021-08-16T18:52:18.000Z

It happens mostly when am not logged in to the bitclout node but using the api

Answer 11 · 2021-08-17T08:51:20.000Z

@tijno sorry for spamming the conversation again... but I keep getting notified now because of the tagline behind your name: "(BitClout @tijn)" 🤣

Answer 12 · 2021-08-17T08:54:26.000Z

oh man github :) sorry @tijn ill change it

Answer 13 · 2021-08-17T08:55:29.000Z

oh man github :) sorry @tijn ill change it

@tijno Thank you!

Answer 14 · 2021-08-18T15:15:49.000Z

fixed the issue by increasing the memory of the server to 64GB.

Answer 15 · 2021-08-31T15:09:27.000Z

Hey -- wanted to drop a comment here as this has been happening on 8 nodes under my company's management. All of the machines have 30gb of memory, and we solve the OOMs by simply using docker's restart flag (I know, not a great option, but it works temporarily). After speaking with @tijno, he runs nodes on a 32gb machine, and max's at around 60% memory usage. I'll also note, that all eight of these nodes have been synced for an extended period of time, and these crashes occur quite randomly. The following OOM occurs:

[2025186.138224] Out of memory: Killed process 215489 (backend) total-vm:266280732kB, anon-rss:30178620kB, file-rss:0kB, shmem-rss:0kB, UID:0 pgtables:108952kB oom_score_adj:0
[2025187.222710] oom_reaper: reaped process 215489 (backend), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB

The OOMs have all been caused by rejected Duplicate Tx's:

E0831 14:01:03.874597 1 server.go:1311] Server._handleTransactionBundle: Rejected transaction < TxHash: 25452952cf8b3a8adc6f3412a2bcc4b9aa4e7960ec4d3052b8f4f8e1ff42d93c, TxnType: BC1YLhhrJUg1ms7P3YMQcjGPTVY9Tf8poJ1Xdeqt6AsoJ5g3zNvFz98, PubKey: PRIVATE_MESSAGE > from peer [ Remote Address: 34.123.41.111:17000 PeerID=5 ] from mempool: TxErrorDuplicate

While increasing memory is definitely a solution, and restarting on the crash is also.... something haha, I see no reason why a node can't run on a 30gb machine. My worry is that there's a potential memory leak, even though such is fairly uncommon in go... Beyond this, I have little idea why an already-synced node would require more than 30g -- especially since this is occurring uniformly across all 8 nodes under our management all after a Duplicate TX error is produced.

It is, of course, also possible that I'm just missing something. Would really appreciate any suggestions, as simply restarting the process after a crash isn't likely the best approach, let alone being effective long-term hahaha

Answer 16 · 2021-08-31T15:57:09.000Z

We profile our nodes 24/7 and aren't aware of any memory leaks. Badger is a memory hog and is on its way out.

Answer 17 · 2021-08-31T16:08:46.000Z

Makes sense -- thanks for the reply @maebeam

Glad to see badger go for a number of reasons hahaha