Only 100 streaming calls get responses under very specific conditions
jskeet opened this issue · 17 comments
Original GitHub issue: googleapis/google-cloud-dotnet#12318
Diagnostic test solution: https://github.com/googleapis/google-cloud-dotnet/tree/main/issues/Issue12318
The most useful directory in there (in terms of looking at the code) is probably RawGrpcStubs.
Currently, I can reproduce this when talking to Firestore. I haven't managed to reproduce it with a local gRPC server - you can see attempts in the repro directory above, but they didn't come to anything.
The repro goes:
- Code: this is just the code generated by the vanilla C# protobuf generator and the gRPC plugin; it doesn't include the code we normally use for full Google client libraries ("GAPIC" code) or the handwritten Firestore code.
- Prework: Create a collection with 300 documents in (the PopulateCollection project does this if you want to repro yourself against Firestore)
- With a single FirestoreClient
- Call ListDocuments (a unary RPC) to get all the documents
- For each document, call Listen (a bidi streaming RPC), send a single request within that stream and observe whether we see any responses
- Wait for 10 seconds to give everything a chance to stabilize, then report results
I would expect every listener to receive a few responses - so we'd report at the end "We have 300 documents, and 300 listeners received a response."
What we actually see is:
- On Linux (.NET 6 and 8; not tested on 7), expected behavior
- On Windows 10 (.NET 6, 7 and 8), expected behavior
- On Windows 11, .NET 6, expected behavior
- On Windows 11, .NET 7, 8 and 9-preview: only 100 of the 300 listeners receive a response
I can give more version details if that would be useful - let me know exactly what's required.
Some experiments:
- Using a different client (with a different channel) for the initial unary RPC: all 300 listeners work. It's as if the initial unary call taints the channel somehow, stopping it from handling all 300 listeners, only handling 100 at a time.
- Running the "listen" part in multiple iterations: we get 100 active clients per iteration
- Explicitly specifying the HttpHandler as a SocketsHttpHandler with various options: doesn't seem to make a difference (experimented with EnableMultipleHttp2Connections and MaxConnectionsPerServer)
- Tried using nightly version 2.61.0-dev202404020702 (with Grpc.Auth 2.61.0) - still just 100 listeners with .NET 8
It's not clear to me what the next step is in terms of diagnosing this - any suggestions would be welcome.
Looking at logging, it looks like all 300 listeners manage to send their request, but then 200 of the 300 don't see any responses. In an earlier test I thought I'd seen only 100 listeners make a request, the rest being blocked - but I suspect that was a diagnostics failure.
I don't remember seeing a difference between Windows 10 and Windows 11 before when it comes to running .NET (as opposed to .NET Framework) code - and it's rare to spot differences from .NET 6 to later versions like this.
Ping for visibility - are there any maintainers here with thoughts on this?
Has the client been configured to create multiple connections if required? 100 streams is a common limit per-connection. It is what Kestrel uses.
Yes, as per my list of experiments, I've experimented with EnableMultipleHttp2Connections and MaxConnectionsPerServer, with no visible impact. If there's another setting you'd like me to try, I'm all ears. (I also don't know why this would be affected my making an initial unary call.)
Ok. I don't know Firestorm or have an account for it. If you have an account I can use in a simple way, such as just running a console app with the given parameters, then I could test out on my machine. Email me details if that's an option.
Alternatively, you can get more information on your machine. For getting more information about the client, I suggest you setup client logging:
https://learn.microsoft.com/en-us/aspnet/core/grpc/diagnostics?view=aspnetcore-8.0#grpc-client-logging
The most likely place for the hang to happen is inside SocketsHttpHandler. The handler uses EventSource instead of Microsoft.Logging.Extensions for its logging, so also create a listener for it at the start of the program. What I use inside gRPC tests for seeing handler logging: https://github.com/grpc/grpc-dotnet/blob/master/test/Shared/HttpEventSourceListener.cs
The detailed handler logging will let you know whether request frames are successfully being sent to the server, or if they're queuing somewhere in the client.
Back to testing now.
Just in case it was relevant, I tried AppContext.SetSwitch("System.Net.SocketsHttpHandler.Http2FlowControl.DisableDynamicWindowSizing", true);
- that didn't help.
Thanks for the logging details... I've now got a 41,000 line log file to look through :)
I'll do what I can myself, but if I get stuck I'll email you some suitable account details.
Looking at logging, it looks like all 300 listeners manage to send their request, but then 200 of the 300 don't see any responses. In an earlier test I thought I'd seen only 100 listeners make a request, the rest being blocked - but I suspect that was a diagnostics failure.
Ooh, looking at the logs again, I was wrong about this - or at least, somewhat wrong. Only ~100 manage to send their request to start with... the other 200 start getting to send their requests when the listen operations are being cancelled.
So the log for .NET 8 looks like this:
- 100x send request
- 100x "each listener receives 4 responses" (3x TargetChange and 1x DocumentChange)
- 10 seconds of no log entries
- Mixture of:
- 100x MoveNext RPC cancelled (for the ones which managed to send requests earlier)
- 200x send request followed by MoveNext RPC canceled (for the ones which didn't manage to send the requests earlier)
The log for .NET 6 looks like this:
- 200x send request
- Lots of listener responses
- 100x send request (but this is only about 30ms after the end of the earlier send requests)
- Lots of listener responses
- 10 second gap
- 300x MoveNext RPC cancelled
- 300x "each listener receives 4 responses" (3x TargetChange and 1x DocumentChange)
- 300x MoveNext RPC cancelled
So I'm pretty sure it is the send that's blocking.
Okay, I've tweaked the test app a little:
- Start 100 listeners
- Wait 6 seconds (so plenty of time for them to send a request and get responses)
- Start 1 listener
- Wait 20 seconds before cancelling everything
This makes it easier to isolate the logs for the one failed listener.
Here's an example of the logging I get - the send call never returns
// Broken send log
08:48:26.367Z dbug: HttpEventSourceListener[0]
<- Event Private.InternalDiagnostics.System.Net.Http - Info : PushStreamContent`2#376535,.ctor, ->
08:48:26.367Z dbug: HttpEventSourceListener[0]
<- Event Private.InternalDiagnostics.System.Net.Http - Associate : HttpRequestMessage#27487090,Content,HttpRequestMessage#27487090,PushStreamContent`2#376535 ->
08:48:26.367Z dbug: Grpc.Net.Client.Internal.GrpcCall[1]
Starting gRPC call. Method type: 'DuplexStreaming', URI: 'https://firestore.googleapis.com/google.firestore.v1.Firestore/Listen'.
08:48:26.367Z dbug: HttpEventSourceListener[0]
<- Event System.Net.Http - RequestStart : https,firestore.googleapis.com,443,/google.firestore.v1.Firestore/Listen,2,0,1 ->
08:48:26.367Z dbug: Grpc.Net.Client.Balancer.Internal.ConnectionManager[6]
Successfully picked subchannel id '1-1' with address firestore.googleapis.com:443. Transport status: ActiveStream
08:48:26.367Z dbug: HttpEventSourceListener[0]
<- Event Private.InternalDiagnostics.System.Net.Http - HandlerMessage : 14347911,0,0,GetHttp3ConnectionAsync,Using existing HTTP3 connection. ->
08:48:26.367Z dbug: HttpEventSourceListener[0]
<- Event Private.InternalDiagnostics.System.Net.Quic - Info : MsQuicContextSafeHandle#60400541,.ctor,[strm][0x1BD82CC0F40] MsQuicSafeHandle created ->
08:48:26.367Z dbug: HttpEventSourceListener[0]
<- Event Private.InternalDiagnostics.System.Net.Quic - Info : MsQuicContextSafeHandle#60400541,.ctor,[strm][0x1BD82CC0F40] [conn][0x1BD82AD6DE0] ref count incremented ->
08:48:26.368Z dbug: HttpEventSourceListener[0]
<- Event Private.InternalDiagnostics.System.Net.Quic - Info : QuicStream#238774,NativeCallback,[strm][0x1BD82CC0F40] Received event START_COMPLETE { Status = 0, ID = 400, PeerAccepted = 0 } ->
Whereas if I isolate a single successful listener, I believe the log for sending the request looks like this:
08:55:02.692Z dbug: HttpEventSourceListener[0]
<- Event Private.InternalDiagnostics.System.Net.Http - Info : PushStreamContent`2#63566392,.ctor, ->
08:55:02.692Z dbug: HttpEventSourceListener[0]
<- Event Private.InternalDiagnostics.System.Net.Http - Associate : HttpRequestMessage#24004906,Content,HttpRequestMessage#24004906,PushStreamContent`2#63566392 ->
08:55:02.692Z dbug: Grpc.Net.Client.Internal.GrpcCall[1]
Starting gRPC call. Method type: 'DuplexStreaming', URI: 'https://firestore.googleapis.com/google.firestore.v1.Firestore/Listen'.
08:55:02.692Z dbug: HttpEventSourceListener[0]
<- Event System.Net.Http - RequestStart : https,firestore.googleapis.com,443,/google.firestore.v1.Firestore/Listen,2,0,1 ->
08:55:02.692Z dbug: Grpc.Net.Client.Balancer.Internal.ConnectionManager[6]
Successfully picked subchannel id '1-1' with address firestore.googleapis.com:443. Transport status: ActiveStream
08:55:02.692Z dbug: HttpEventSourceListener[0]
<- Event Private.InternalDiagnostics.System.Net.Http - HandlerMessage : 35236192,0,0,GetHttp3ConnectionAsync,Using existing HTTP3 connection. ->
08:55:02.692Z dbug: HttpEventSourceListener[0]
<- Event Private.InternalDiagnostics.System.Net.Quic - Info : MsQuicContextSafeHandle#45725801,.ctor,[strm][0x2A202CDD110] MsQuicSafeHandle created ->
08:55:02.692Z dbug: HttpEventSourceListener[0]
<- Event Private.InternalDiagnostics.System.Net.Quic - Info : MsQuicContextSafeHandle#45725801,.ctor,[strm][0x2A202CDD110] [conn][0x2A202CD6DE0] ref count incremented ->
08:55:02.693Z dbug: HttpEventSourceListener[0]
<- Event Private.InternalDiagnostics.System.Net.Quic - Info : QuicStream#63504289,NativeCallback,[strm][0x2A202CDD110] Received event START_COMPLETE { Status = 0, ID = 4, PeerAccepted = 1 } ->
08:55:02.693Z dbug: HttpEventSourceListener[0]
<- Event System.Net.Http - RequestLeftQueue : 0.2584,3,0 ->
08:55:02.693Z dbug: HttpEventSourceListener[0]
<- Event Private.InternalDiagnostics.System.Net.Http - HandlerMessage : 35236192,52392654,0,SendAsync,Sending request: Method: POST, RequestUri: 'https://firestore.googleapis.com/google.firestore.v1.Firestore/Listen', Version: 2.0, Content: Grpc.Net.Client.Internal.Http.PushStreamContent`2[Google.Cloud.Firestore.V1.ListenRequest,Google.Cloud.Firestore.V1.ListenResponse], Headers:
{
User-Agent: grpc-dotnet/2.62.0 (.NET 8.0.4; CLR 8.0.4; net8.0; windows; x64)
TE: trailers
grpc-accept-encoding: identity,gzip,deflate
google-cloud-resource-prefix: projects/jonskeet-uberproject/databases/(default)
Authorization: Bearer eyJhbGciOiJSUzI1NiIsImtpZCI6IjVlMDQ4YThjNGZjMDM0ZjIzZjcyZTkzZWUzNWUyZmY3NWY0YzlmZGMiLCJ0eXAiOiJKV1QifQ.eyJzY29wZSI6Imh0dHBzOi8vd3d3Lmdvb2dsZWFwaXMuY29tL2F1dGgvY2xvdWQtcGxhdGZvcm0gaHR0cHM6Ly93d3cuZ29vZ2xlYXBpcy5jb20vYXV0aC9kYXRhc3RvcmUiLCJlbWFpbF92ZXJpZmllZCI6ZmFsc2UsImlzcyI6IjE0OTc3Njg0ODMzNS1nNmR2Z2M2YTRtNWw1amw0aW0zZjlhbWVua2JqZWtpZ0BkZXZlbG9wZXIuZ3NlcnZpY2VhY2NvdW50LmNvbSIsInN1YiI6IjE0OTc3Njg0ODMzNS1nNmR2Z2M2YTRtNWw1amw0aW0zZjlhbWVua2JqZWtpZ0BkZXZlbG9wZXIuZ3NlcnZpY2VhY2NvdW50LmNvbSIsImV4cCI6MTcxNDY0MzY5NiwiaWF0IjoxNzE0NjQwMDk2fQ.77rI6hFL1QLQTR14qmtiBWjodu3RT4iOv8vjY5j3QWfOQOroW9mbGttOGBY_8WhDK9n-jH-fKlsyeqCoe8SaDD_xm0Ahq3r7676MsFci6PsKUD1PO1yGP8bDRhzesxBvJqFFL3cONyVjHcmSjoXHjPtxcCf88oDeyZMOpkIg5mDs0mC88iYxWLh-gxDhvAbK6wLcXPVREWvvkIO6KuqwI37cYCkZWliFHG1FPfovHKGhIxE9rkk3sRYzHSWkq1FstacKe_Q42PWQL6wBFsZ6mwiG25Y0nByMweMOQBVpZnpSMV9ad6yv1vIi7mE_u2ffsHhCf3dYjvFbHLgITczz3Q
Transfer-Encoding: chunked
traceparent: 00-baeba94df59b874132a13ca7151da0b8-be3d633bca67fc7a-00
Content-Type: application/grpc
} ->
08:55:02.693Z dbug: HttpEventSourceListener[0]
<- Event System.Net.Http - RequestHeadersStart : 1 ->
08:55:02.693Z dbug: HttpEventSourceListener[0]
<- Event System.Net.Http - RequestHeadersStop : ->
08:55:02.693Z dbug: HttpEventSourceListener[0]
<- Event System.Net.Http - RequestContentStart : ->
08:55:02.693Z dbug: HttpEventSourceListener[0]
<- Event Private.InternalDiagnostics.System.Net.Quic - Info : QuicStream#63504289,WriteAsync,[strm][0x2A202CDD110] Stream writing memory of '1290' bytes while not completing writes. ->
08:55:02.693Z dbug: HttpEventSourceListener[0]
<- Event System.Net.Http - ResponseHeadersStart : ->
08:55:02.693Z dbug: HttpEventSourceListener[0]
<- Event Private.InternalDiagnostics.System.Net.Quic - Info : QuicStream#63504289,ReadAsync,[strm][0x2A202CDD110] Stream reading into memory of '64' bytes. ->
08:55:02.693Z dbug: HttpEventSourceListener[0]
<- Event Private.InternalDiagnostics.System.Net.Quic - Info : QuicStream#63504289,NativeCallback,[strm][0x2A202CDD110] Received event SEND_COMPLETE { Canceled = 0 } ->
08:55:02.693Z dbug: Grpc.Net.Client.Internal.GrpcCall[18]
Sending message.
08:55:02.693Z dbug: HttpEventSourceListener[0]
<- Event Private.InternalDiagnostics.System.Net.Quic - Info : QuicStream#63504289,WriteAsync,[strm][0x2A202CDD110] Stream writing memory of '3' bytes while not completing writes. ->
08:55:02.693Z dbug: HttpEventSourceListener[0]
<- Event Private.InternalDiagnostics.System.Net.Quic - Info : QuicStream#63504289,NativeCallback,[strm][0x2A202CDD110] Received event SEND_COMPLETE { Canceled = 0 } ->
08:55:02.693Z dbug: HttpEventSourceListener[0]
<- Event Private.InternalDiagnostics.System.Net.Quic - Info : QuicStream#63504289,WriteAsync,[strm][0x2A202CDD110] Stream writing memory of '163' bytes while not completing writes. ->
08:55:02.693Z dbug: HttpEventSourceListener[0]
<- Event Private.InternalDiagnostics.System.Net.Quic - Info : QuicStream#63504289,NativeCallback,[strm][0x2A202CDD110] Received event SEND_COMPLETE { Canceled = 0 } ->
08:55:02.693Z dbug: HttpEventSourceListener[0]
<- Event Private.InternalDiagnostics.System.Net.Quic - Info : QuicStream#63504289,WriteAsync,[strm][0x2A202CDD110] Stream writing memory of '0' bytes while not completing writes. ->
So the first part of the "good" log that's not in the "bad" log is RequestLeftQueue. I assume that means it's still in the queue, but I don't know enough about the queue to know what the next step should be...
I see what it is. The client and server are using HTTP/3. The specific conditions are the same conditions that .NET supports HTTP/3.
I'm sure the problem is still the connection's stream limit, but I don't think there is a way to allow the .NET client to create more connections when the limit is reached for HTTP/3. Couple of issues to improve:
I think an API for this will be added for .NET 9.
You can work around this issue by creating a delegating handler that sets HttpRequestMessage.VersionPolicy
to exact instead of greater than. Then the gRPC client will only try to negotiate a HTTP/2 connection and will ignore HTTP/3 upgrade.
Thanks for filing the additional issues. I'll try the version policy to validate your diagnosis - it sounds right to me though. (Would love to know why the initial unary RPC makes a difference, but...)
We (Google Cloud client libraries) have a recurrent problem that when there are aspects of the HTTP message handler to be tweaked, we've no clean way of doing that - basically either we ask the customer to create the handler themselves, or we'd have to reproduce all the logic in Grpc.Net to work out which handler to reproduce. This is another example of where it would be nice to have more flexibility.
I'll ponder that aspect more and create a new issue with a proposal. I'm expecting a bunch of back and forth on it though :)
Using a different client (with a different channel) for the initial unary RPC: all 300 listeners work. It's as if the initial unary call taints the channel somehow, stopping it from handling all 300 listeners, only handling 100 at a time.
HTTP/3 is negotiated by a response header. The initial unary request is HTTP/2, then the follow up listeners calls are on an upgraded HTTP/3 connection.
I'm guessing that without the initial unary request then the 300 listeners are created on HTTP/2 before a response is received.
Hooray - I can confirm that with the following code, all is well:
FirestoreClient CreateClient()
{
string[] scopes =
{
"https://www.googleapis.com/auth/cloud-platform",
"https://www.googleapis.com/auth/datastore",
};
var cred = GoogleCredential.GetApplicationDefault().CreateScoped(scopes);
if (cred.UnderlyingCredential is ServiceAccountCredential serviceCredential)
{
cred = GoogleCredential.FromServiceAccountCredential(serviceCredential.WithUseJwtAccessWithScopes(true));
}
var channelOptions = new GrpcChannelOptions
{
Credentials = cred.ToChannelCredentials(),
HttpHandler = new ExactVersionHandler(new SocketsHttpHandler { EnableMultipleHttp2Connections = true })
};
var channel = GrpcChannel.ForAddress("https://firestore.googleapis.com", channelOptions);
return new FirestoreClient(channel);
}
internal class ExactVersionHandler : DelegatingHandler
{
internal ExactVersionHandler(HttpMessageHandler handler) : base(handler)
{
}
protected override Task<HttpResponseMessage> SendAsync(HttpRequestMessage request, CancellationToken cancellationToken)
{
request.VersionPolicy = HttpVersionPolicy.RequestVersionExact;
return base.SendAsync(request, cancellationToken);
}
protected override HttpResponseMessage Send(HttpRequestMessage request, CancellationToken cancellationToken)
{
request.VersionPolicy = HttpVersionPolicy.RequestVersionExact;
return base.Send(request, cancellationToken);
}
}
(That's clearly not production code, of course... but it proves your hypothesis.)
Thanks so much! I'll leave this open for you to handle as you wish - now I need to work out what to do in client libraries :)
Overloading Send
isn't necessary. gRPC only uses SendAsync
.
I think it might be worth adding GrpcChannelOptions.HttpVersion
and GrpcChannelOptions.HttpVersionPolicy
properties.
Overloading
Send
isn't necessary. gRPC only usesSendAsync
.
Right. I only did so here as it was quicker to do that than check whether it would be used ;)
I think it might be worth adding
GrpcChannelOptions.HttpVersion
andGrpcChannelOptions.HttpVersionPolicy
properties.
That would be wonderful. I think it might be worth doing a general audit of "what's worth including in channel options?" - high on my priority list would be keep-alive settings, if you'd be open to that.
Any more thoughts on updating GrpcChannelOptions? We've seen a few more issue like this where we've suggested the workaround above. I'll put a bit of a hack in Google.Api.Gax.Grpc as an option if necessary, but GrpcChannelOptions would be much better...
I think it is a good idea. And defaulting to HTTP/2 only seems like a good idea too. Making HTTP/3 an explicit choice would reduce confusion.
Just need to find the time to do it.