libp2p/js-libp2p

`.stop()` sometimes hangs indefinitely

marcus-pousette opened this issue · 8 comments

  • Version:
    1.8.1

  • Subsystem:
    @libp2p/websocket , connection manager, components

Severity:

High

Description:

Steps to reproduce the error:

I have a test scenario with a few nodes that I connect to each other and send some data

Sometimes when terminating a node it endlessly waits for the Websocket transport to shutdown. This is because there are connections that does close (?) and in the it-ws library there are no calls to server.closeAllConnections() before close.

I have managed to make the shutdown work better if modify

private async _invokeStartableMethod (methodName: 'beforeStart' | 'start' | 'afterStart' | 'beforeStop' | 'stop' | 'afterStop'): Promise<void> {
to terminate the components in the reverse order in sequentially... so I suspect there is some kind of race condition going on or leak of some sort

Also no issues if I use Tcp transport instead.

This is not a good description on how to reproduce the issue, because it is a bit random when it occurs. But wanted to create this issue if others have the same problem and will update the description if I have a isolated a good test

When the stop timeouts, I have one connection left with 0 streams,
remoteAddr: '/ip6/::ffff:7f00:1/tcp/56327/p2p/12D3KooWApN6mEB5666wTpqpzkgcKXC7kgMxmm7CAWRiD3WAg4Yv'
direction: 'inbound'
encryption: '/noise'
multiplexer: '/yamux/1.0.0'
transient: false

Closing the connectionManager component before all other components seems also to fix the issue

async _invokeStartableMethod(methodName) {
        const timeout = new Promise((resolve, reject) => {
            setTimeout(() => {
                reject('timeout')
            }, 10000);
        });
        const connectionManager = this.components["connectionManager"]
        if (isStartable(connectionManager)) {
            await connectionManager[methodName]?.();
        }
        let promise = Promise.all(Object.values(this.components)
            .filter(obj => isStartable(obj))
            .map(async (startable) => {
                await startable[methodName]?.();
            }));
        const race = Promise.race([promise, timeout]);
        try {
            await race;
        }
        catch (e) {
            throw e;
        }
        finally {
            clearTimeout(timeout);
        }
    }

in the it-ws library there are no calls to server.closeAllConnections() before close.

I may be missing something but where is the closeAllConnections method defined? The it-ws module stores a reference to a http/https net.Server which has no such method.

no issues if I use Tcp transport

Looking at the code, the TCP listener aborts each connection on close whereas the WebSocket transport tries to close them gracefully.

Do you still see the issue if the transport aborts the connections instead of closing them on close?

http.Server and https.Server has that method since NodeJs 18.2.0

/// it-ws/server.ts

 async close (): Promise<void> {
    await new Promise<void>((resolve, reject) => {
      this.server.closeAllConnections()
      this.server.close((err) => {
        if (err != null) {
          reject(err); return
        }

        resolve()
      })
    })
  }
  

Actually the closeAllConnections trick does work since the connection is a tcp one (?)

The close connection manager first trick seems to always work.

Also adding like a 5s delay before stopping also seems to work.

It feels like there is some kind of connection creation in process that is not caught correctly on closing

http.Server and https.Server has that method since NodeJs 18.2.0

Aha, that's interesting, nice find.

The docs do say though:

This does not destroy sockets upgraded to a different protocol, such as WebSocket or HTTP/2

...so I wonder if there's something else going on here?

Ye the whole thing about aborting Websocket connections before closing does not seem to have an impact. Only the two things I mentioned in my last comment did mitigate the problem for me

Looking at the code, the TCP listener aborts each connection on close whereas the WebSocket transport tries to close them gracefully.

That would explain something!