status-im/infra-nim-waku

Memory issues on `ac-cn-hongkong-c.wakuv2.prod` host

jakubgs opened this issue · 5 comments

On 2021-10-08 starting around 07:20 UTC the node-01.ac-cn-hongkong-c.wakuv2.prod host started having memory issues:

image

There was also a few major CPU usage spikes:

image

It appears this coincides with a major traffic spike:

image

Which caused a spike in orphaned sockets:

image

This did not subside until I restarted the host.

We can see a spike in logs around that time:

image

It was mainly the nim-waku container that grew in usage, but not that much:

image

Actually, we don't detect log level for nim-waku logs and that graph also included websockify logs.

This is more like it:

image

https://kibana.infra.status.im/goto/f3881123aa4d789da02995909c9e2b10

Seems like most "errors"(though their level is WRN...) are either on of these twos:

failed to store messages    topics="wakustore" tid=1 file=waku_store.nim:456 err="failed to prepare"
failed to store peers       topics="wakupeers" tid=1 file=peer_manager.nim:44 err="failed to encode: Failed to encode public key"

But the spike doesn't exist anymore:

image

We can also see a sprinkling of:

metrics error: Unable to send response

image

Thanks for this, @jakubgs. Went through some logs/graphs and believe our main issue is the poor performance and unbounded memory usage of the store, as logged in waku-org/nwaku#702.

I imagine that as the store grows and available memory falls, more and more CPU cycles will be spent on garbage collection, memory swapping, etc. This probably also explains why some users have complained that the prod fleet is slow.

According to me this is the highest priority stability issue in nim-waku.