grpc/grpc-swift

Timed out waiting for an HTTP/2 stream from the connection pool, cause: ioOnClosedChannel

sergiymomot opened this issue · 11 comments

Describe the bug

We have reports from the users that after keeping the app in the background for some time, when the app is brought back into the foreground all network requests fail until the app is re-launched. Internally, one of our employees encountered the same issue and in our logs, we see deadline exceeded (4): Timed out waiting for an HTTP/2 stream from the connection pool, cause: ioOnClosedChannel error.

We use GRPCChannelPool with the following initialization:

private let grpcEventLoopGroup = PlatformSupport.makeEventLoopGroup(loopCount: 1)
...
return try GRPCChannelPool.with(
    target: .host(endpoint.url.absoluteString),
    transportSecurity: .tls(.makeClientDefault(compatibleWith: grpcEventLoopGroup)),
    eventLoopGroup: grpcEventLoopGroup, {
        $0.idleTimeout = TimeAmount.minutes(5)
        $0.connectionPool.connectionsPerEventLoop = 16
        $0.connectionPool.maxWaitTime = .seconds(10)
    }
)

swift-grpc version: 1.18.0
platform: iOS (all versions)

We are not sure if this is a bug, behavior by design, or we are not using the library correctly, so any help is highly appreciated!

To reproduce

This behavior is not easily reproducible.

Expected behavior

Users should be able to make network requests after bringing the app into the foreground after a long background stay.

Thanks for filing this. The current theory is that when the app is backgrounded, the file descriptor is invalidated but when the app comes to the foreground again gRPC doesn't know this. Any writes which happen on the connection would then cause a connection error (this is likely the cause: ioOnClosedChannel in the error).

What isn't clear at the moment is why the connection pool hasn't thrown that connection away and replaced it with a new one. It's surprising that all network requests fail until the app has re-launched and suggests that gRPC gets into a bad state which it can't recover from. Do you have any gRPC logs surrounding these events?

@glbrntt unfortunately, we don't have any logs. As I mentioned, this is hard to reproduce since it does not happen after any background period of the app, only longer ones. But we will keep trying to reproduce.

In terms of a possible workaround for now, would it make sense to listen to this error, and when it happens, shut down the existing connection pool and create a new one? And if yes, is there a way to monitor this error outside of just making a network request and getting failure in the form of this error, or maybe a way to check the state of the connection pool to determine whether it is in the unrecoverable state (e.g. channel closed)?

@glbrntt unfortunately, we don't have any logs. As I mentioned, this is hard to reproduce since it does not happen after any background period of the app, only longer ones. But we will keep trying to reproduce.

That'd be great if you could get logs.

In terms of a possible workaround for now, would it make sense to listen to this error, and when it happens, shut down the existing connection pool and create a new one? And if yes, is there a way to monitor this error outside of just making a network request and getting failure in the form of this error, or maybe a way to check the state of the connection pool to determine whether it is in the unrecoverable state (e.g. channel closed)?

Your best bet is likely to be closing the connection pool when the app is backgrounded and recreating it when it comes to the foreground.

This is relevant to me as well. I have to wait 20 seconds for my channel to update its status. This is true for GRPCChannelPool as well as for ClientConnection. Now I have 2 options:

  1. Disconnect the channel and start it again
  2. Either update Keepalive to interval 1-5 seconds so that it would not be so noticeable for the user

It seems to be a bug, because when a request is sent to the server after the internet has been disconnected, the channel does not update the connection for a long time, although ideally when sending a request there should be a check that there is a connection between the channel and the server is alive.

Expected state

  1. Keepalive 5 minutes (Custom)
  2. If a request is sent to the server, it is checked whether the connection between the channel and the server is alive.

@glbrntt Any updates?

Can you provide more context, does this happen when the app has been in the background?

@glbrntt

It happens most often when I go into the background. After I go back to the foreground, the channel does not update its state and I have to wait more than 15 seconds for the client to connect to the channel.

E.g.

func makeChannelConfiguration() -> Configuration {
  let gRPCTLSConfiguration: GRPCTLSConfiguration = .makeClientDefault(for: .best)
  let eventLoopGroup: EventLoopGroup = PlatformSupport.makeEventLoopGroup(
      compatibleWith: gRPCTLSConfiguration,
      loopCount: 1
  )
  let keepalive = ClientConnectionKeepalive(
      interval: .minutes(5),
      timeout: .seconds(5)
  )
  // ...
}

func makeConnectionChannel() -> ClientConnection {
  ClientConnection
      .usingTLS(
          with: configuration.tlsConfiguration,
          on: configuration.eventLoopGroup
      )
      .withKeepalive(configuration.keepalive)
      .withConnectivityStateDelegate(
          WeakChannelProvider(self),
          executingOn: DispatchQueue.api
      )
      .withConnectionTimeout(minimum: .milliseconds(500))
      .withConnectionBackoff(initial: .milliseconds(100))
      .withConnectionBackoff(maximum: .seconds(5))
      .withConnectionBackoff(multiplier: 1)
}

Also, if I decide to close the channel when I go into background mode to avoid waiting for the channel state to update, I sometimes get application hangs until the channel connects to the network.

It happens most often when I go into the background. After I go back to the foreground, the channel does not update its state and I have to wait more than 15 seconds for the client to connect to the channel.

Do you have any logs for this? It's hard to say whether this is a bug or not without any more information.

Also, if I decide to close the channel when I go into background mode to avoid waiting for the channel state to update, I sometimes get application hangs until the channel connects to the network.

The app hanging won't be the result of grpc connecting as connecting, making RPCs etc. is all asynchronous.

@glbrntt

What kind of logs do you need exactly? Basically what I see in the logs is the following:

> Returning to the foreground after background mode
> // Nothing happens
> I made a request to the server
> // Nothing happens for 15 seconds 
> The gRPC channel status changed to error then my request services went down. 
> The channel connected and the status changed to successful.

(If set Keepalive interval to 1 second then reconnection will happen immediately)

It seems that on Android the channel is checked when an attempt to send a request to the server was made, so the channel state is updated quickly in case of a problem. And iOS client as if it is always oriented only on Keepalive configuration

@glbrntt

What kind of logs do you need exactly?

Logs from grpc-swift, you can configure a background activity logger (set on the connection) and a per-request logger (set in call options). Ideally we'd have both at trace.

Basically what I see in the logs is the following:

> Returning to the foreground after background mode
> // Nothing happens
> I made a request to the server
> // Nothing happens for 15 seconds 
> The gRPC channel status changed to error then my request services went down. 
> The channel connected and the status changed to successful.

(If set Keepalive interval to 1 second then reconnection will happen immediately)

It seems that on Android the channel is checked when an attempt to send a request to the server was made, so the channel state is updated quickly in case of a problem. And iOS client as if it is always oriented only on Keepalive configuration

If the app was backgrounded we'll only know the actual state of the connection when we attempt to write. It's not obvious to me why we have that 15 second delay though, so logs or being able to reproduce this easily would be helpful to diagnose this.

The delay can be different, I will try to get the logs for you