bug: node getting stuck and missing messages

Question

bug: node getting stuck and missing messages

Opened this issue 2 months ago · 2 comments

Problem

When running simulations, @AlbertoSoutullo found that very often there's one node that misses lots of messages.
Looking at the logs, it seems that the node gets stuck for approximately 50 seconds.

Here's the moment where it happens

TRC 2024-07-19 11:17:17.632+00:00 waiting for data                           topics="libp2p pubsubpeer" tid=7 file=pubsubpeer.nim:196 conn=16U*zXAqnw:669a4b1af74509e547f73df3 peer=16U*zXAqnw closed=false
TRC 2024-07-19 11:18:07.995+00:00 running heartbeat                          topics="libp2p gossipsub" tid=7 file=behavior.nim:775 instance=140313549693008

In runs, it is consistent of this happening for precisely 50 seconds. It also happens at a moment where the node establishes lots of connections.

Impact

Critical

Expected behavior

Nodes shouldn't get stuck and should receive all messages

Screenshots/logs

logs.zip

nwaku version/commit hash

branch: release/v0.31 commit b34008e. Also reproduced in v0.30.1

Answer 1 · 2024-07-19T14:10:33.000Z

It seems to be related to the node establishing lots of connections in a short timespan.

Created an image with only this workaround allowing a maximum of 20 connections in each connectivity loop iteration and the issue stopped getting reproduced

(branch debug-extra-nim-libp2p-logs-over-v0.31.0-with-limited-connections)

Answer 2 · 2024-07-19T14:13:41.000Z

Thanks for creating the issue! c:

In order to add a little bit more of information, we are confident that this is not an issue related with the lab.
The information that we have right now is:

In the simulations, we injected 60 messages in 1 minute. The mesh is formed by 100 nwaku nodes.
For all messages, there is one peer that misses 75%~ of the messages.
Analyzing the logs, we double-checked that there were multiple nodes that actually sent the message to the problematic node.
This no longer happens (tested more than 10 times, before this was happening in 50% of the tests) after the fix mentioned in the previous comment.