bug: node getting stuck and missing messages
Opened this issue · 2 comments
Problem
When running simulations, @AlbertoSoutullo found that very often there's one node that misses lots of messages.
Looking at the logs, it seems that the node gets stuck for approximately 50 seconds.
Here's the moment where it happens
TRC 2024-07-19 11:17:17.632+00:00 waiting for data topics="libp2p pubsubpeer" tid=7 file=pubsubpeer.nim:196 conn=16U*zXAqnw:669a4b1af74509e547f73df3 peer=16U*zXAqnw closed=false
TRC 2024-07-19 11:18:07.995+00:00 running heartbeat topics="libp2p gossipsub" tid=7 file=behavior.nim:775 instance=140313549693008
In runs, it is consistent of this happening for precisely 50 seconds. It also happens at a moment where the node establishes lots of connections.
Impact
Critical
Expected behavior
Nodes shouldn't get stuck and should receive all messages
Screenshots/logs
nwaku version/commit hash
branch: release/v0.31
commit b34008e
. Also reproduced in v0.30.1
It seems to be related to the node establishing lots of connections in a short timespan.
Created an image with only this workaround allowing a maximum of 20 connections in each connectivity loop iteration and the issue stopped getting reproduced
(branch debug-extra-nim-libp2p-logs-over-v0.31.0-with-limited-connections
)
Thanks for creating the issue! c:
In order to add a little bit more of information, we are confident that this is not an issue related with the lab.
The information that we have right now is:
- In the simulations, we injected 60 messages in 1 minute. The mesh is formed by 100 nwaku nodes.
- For all messages, there is one peer that misses 75%~ of the messages.
- Analyzing the logs, we double-checked that there were multiple nodes that actually sent the message to the problematic node.
- This no longer happens (tested more than 10 times, before this was happening in 50% of the tests) after the fix mentioned in the previous comment.