Tribler/py-ipv8

LAN discovery blocks asyncio thread

egbertbouman opened this issue · 5 comments

Since IPv8 2.11 on_introduction_request/ on_new_introduction_requestintermittently takes 0.85s on my Windows 10 machine, blocking the asyncio thread. It turns out that the new LAN address discovery is causing the problems:

This code is replacing netifaces, which used to block for 1.577 seconds on some machines. With 0.85 seconds now, at least it's getting better.. I guess?

Jokes aside, it is especially important to wait for these results before sending out the first intro messages to the bootstrappers. Many of our experiments depend on local peers quickly realising that they are on each other's LAN (waiting for the next results after 10 seconds is unacceptable). All following calls can be dumped on some thread though. This is not trivial to implement.

Administration: I'll assign this issue "medium" priority for three reasons. Firstly, Tribler still ships with netifaces and this code is, therefore, not run in production (yet). Secondly, the blocking code is only executed every 10 seconds. Lastly, this does not crash the IPv8 application. However, this is definitely not "low" priority as blocking the reactor thread and, therefore, the socket for even a second can lead to major packet loss. This should be fixed before the next IPv8 release.

With netifaces (IPv8 2.10) I'm seeing no delays on the asyncio thread. So at least on my machine, the newer code is significantly worse.

The delay of 1.577s you're referring to is for 580 calls, which I would consider to be pretty fast. I'm currently seeing delays of 0.85s for a single call.

I guess the problem with all of the calls in messaging.interfaces.lan_addresses is that each and every one of them may block on some machines.

Ah, yes you're right, the 1.577 was cumulative, my bad. The actual problem was described seems to be worse with netifaces described as "intermittently exhibits slow performance, sometimes taking several seconds to execute a single call, blocking the asyncio loop" (see same linked post). However, again, on one particular machine.

From what I understand the intermittent slow performance of DiscoveryBooster.take_step also has to do with trying to do too many things without yielding the event loop (i.e. calling ifaddresses 580 times in a row), and that's what's causing the event loop to block. That sounds like a different issue to the one I'm describing.

trying to do too many things without yielding the event loop

That shouldn't be the case (unless there is some glorious bug inside of the asyncio socket wrapper). These should be individual asyncio calls (from individually scheduled message handlers). That would mean that the underlying issue is the same for all of this: some network calls can block for unknown amounts of time on some computers. For your machine it is apparently socket.getfqdn(), for the other issue netifaces.ifaddresses() and for the 5 machines I tested with, none of this code blocked ("works on my machines").

With that, I think the reasonable solution is to somehow (open to interpretation for whomever wants to fix this) time these calls out if they take too long.