Memory issues on `ac-cn-hongkong-c.wakuv2.prod` host

Question

Memory issues on `ac-cn-hongkong-c.wakuv2.prod` host

jakubgs opened this issue 3 years ago · 5 comments

On 2021-10-08 starting around 07:20 UTC the node-01.ac-cn-hongkong-c.wakuv2.prod host started having memory issues:

There was also a few major CPU usage spikes:

It appears this coincides with a major traffic spike:

Which caused a spike in orphaned sockets:

This did not subside until I restarted the host.

Answer 1 · 2021-10-11T07:29:54.000Z

We can see a spike in logs around that time:

Answer 2 · 2021-10-11T07:37:32.000Z

It was mainly the nim-waku container that grew in usage, but not that much:

Answer 3 · 2021-10-11T07:41:09.000Z

Actually, we don't detect log level for nim-waku logs and that graph also included websockify logs.

This is more like it:

https://kibana.infra.status.im/goto/f3881123aa4d789da02995909c9e2b10

Seems like most "errors"(though their level is WRN...) are either on of these twos:

failed to store messages    topics="wakustore" tid=1 file=waku_store.nim:456 err="failed to prepare"

failed to store peers       topics="wakupeers" tid=1 file=peer_manager.nim:44 err="failed to encode: Failed to encode public key"

But the spike doesn't exist anymore:

Answer 4 · 2021-10-11T07:43:37.000Z

We can also see a sprinkling of:

metrics error: Unable to send response

Answer 5 · 2021-10-11T08:39:14.000Z

Thanks for this, @jakubgs. Went through some logs/graphs and believe our main issue is the poor performance and unbounded memory usage of the store, as logged in waku-org/nwaku#702.

I imagine that as the store grows and available memory falls, more and more CPU cycles will be spent on garbage collection, memory swapping, etc. This probably also explains why some users have complained that the prod fleet is slow.

According to me this is the highest priority stability issue in nim-waku.