swift-server/async-http-client

Crash: precondition failure in RequestBag+StateMachine.swift

rausnitz opened this issue · 15 comments

We're on version 1.9.0 and we saw hundreds of these crashes in our server logs today. It looks like it started happening in our logs more than a week ago but suddenly escalated today.

We tried rolling back to 1.8.2 but that did not resolve the issue.

0x5561fde6eed2, Backtrace.(printBacktrace in _B82A8C0ED7C904841114FDF244F9E58E)(signal: Swift.Int32) -> () at /app/.build/checkouts/swift-backtrace/Sources/Backtrace/Backtrace.swift:66
0x7f108b62497f
0x5561fdd4ab60, Swift runtime failure: precondition failure at /app/.build/checkouts/async-http-client/Sources/AsyncHTTPClient/RequestBag+StateMachine.swift:0
0x5561fdd4ab60, AsyncHTTPClient.RequestBag.StateMachine.resumeRequestBodyStream() -> AsyncHTTPClient.RequestBag<A>.StateMachine.ResumeProducingAction at /app/.build/checkouts/async-http-client/Sources/AsyncHTTPClient/RequestBag+StateMachine.swift:130
0x5561fdd52166, AsyncHTTPClient.RequestBag.(resumeRequestBodyStream0 in _828EC98A744D86D088B343098CA6B2A6)() -> () at /app/.build/checkouts/async-http-client/Sources/AsyncHTTPClient/RequestBag.swift:102
0x5561fe5668bb, reabstraction thunk helper from @escaping @callee_guaranteed () -> () to @escaping @callee_guaranteed () -> (@out ()) at /app/<compiler-generated>:0
0x5561fe5668bb, partial apply forwarder for reabstraction thunk helper from @escaping @callee_guaranteed () -> () to @escaping @callee_guaranteed () -> (@out ()) at /app/<compiler-generated>:0
0x5561fe560110, reabstraction thunk helper from @escaping @callee_guaranteed () -> (@out ()) to @escaping @callee_guaranteed () -> () at /app/<compiler-generated>:0
0x5561fe560110, closure #4 () -> () in NIOPosix.SelectableEventLoop.run() throws -> () at /app/.build/checkouts/swift-nio/Sources/NIOPosix/SelectableEventLoop.swift:520
0x5561fe560110, reabstraction thunk helper from @callee_guaranteed () -> (@error @owned Swift.Error) to @escaping @callee_guaranteed () -> (@out (), @error @owned Swift.Error) at /app/<compiler-generated>:0
0x5561fe560110, generic specialization <()> of NIOPosix.withAutoReleasePool<A>(() throws -> A) throws -> A at /app/.build/checkouts/swift-nio/Sources/NIOPosix/SelectableEventLoop.swift:28
0x5561fe560110, NIOPosix.SelectableEventLoop.run() throws -> () at /app/.build/checkouts/swift-nio/Sources/NIOPosix/SelectableEventLoop.swift:519
0x5561fe54b6c4, static NIOPosix.MultiThreadedEventLoopGroup.(runTheLoop in _C2B1528F4FBA68A3DBFA89DBAEBE9D4D)(thread: NIOPosix.NIOThread, parentGroup: Swift.Optional<NIOPosix.MultiThreadedEventLoopGroup>, canEventLoopBeShutdownIndividually: Swift.Bool, selectorFactory: () throws -> NIOPosix.Selector<NIOPosix.NIORegistration>, initializer: (NIOPosix.NIOThread) -> (), _: (NIOPosix.SelectableEventLoop) -> ()) -> () at /app/.build/checkouts/swift-nio/Sources/NIOPosix/MultiThreadedEventLoopGroup.swift:89
0x5561fe54b6c4, closure #1 (NIOPosix.NIOThread) -> () in static NIOPosix.MultiThreadedEventLoopGroup.(setupThreadAndEventLoop in _C2B1528F4FBA68A3DBFA89DBAEBE9D4D)(name: Swift.String, parentGroup: NIOPosix.MultiThreadedEventLoopGroup, selectorFactory: () throws -> NIOPosix.Selector<NIOPosix.NIORegistration>, initializer: (NIOPosix.NIOThread) -> ()) -> NIOPosix.SelectableEventLoop at /app/.build/checkouts/swift-nio/Sources/NIOPosix/MultiThreadedEventLoopGroup.swift:110
0x5561fe54e20e, partial apply forwarder for closure #1 (NIOPosix.NIOThread) -> () in static NIOPosix.MultiThreadedEventLoopGroup.(setupThreadAndEventLoop in _C2B1528F4FBA68A3DBFA89DBAEBE9D4D)(name: Swift.String, parentGroup: NIOPosix.MultiThreadedEventLoopGroup, selectorFactory: () throws -> NIOPosix.Selector<NIOPosix.NIORegistration>, initializer: (NIOPosix.NIOThread) -> ()) -> NIOPosix.SelectableEventLoop at /app/<compiler-generated>:0
0x5561fe54e509, reabstraction thunk helper from @escaping @callee_guaranteed (@guaranteed NIOPosix.NIOThread) -> () to @escaping @callee_guaranteed (@in_guaranteed NIOPosix.NIOThread) -> (@out ()) at /app/<compiler-generated>:0
0x5561fe54e22d, partial apply forwarder for reabstraction thunk helper from @escaping @callee_guaranteed (@guaranteed NIOPosix.NIOThread) -> () to @escaping @callee_guaranteed (@in_guaranteed NIOPosix.NIOThread) -> (@out ()) at /app/<compiler-generated>:0
0x5561fe57abb5, closure #1 (Swift.Optional<Swift.UnsafeMutableRawPointer>) -> Swift.Optional<Swift.UnsafeMutableRawPointer> in static NIOPosix.ThreadOpsPosix.run(handle: inout Swift.Optional<Swift.UInt>, args: NIOPosix.Box<(body: (NIOPosix.NIOThread) -> (), name: Swift.Optional<Swift.String>)>, detachThread: Swift.Bool) -> () at /app/.build/checkouts/swift-nio/Sources/NIOPosix/ThreadPosix.swift:105
0x7f108b6196da
0x7f1089b2661e
0xffffffffffffffff

Any ideas on how we can avoid this or debug it further?

Hmm, this appears to be happening due to overzealous policing of the .finished state in the request bag state machine. Specifically, if we resume the request body stream but have already failed the request, we'll crash. The crash is reasonable in some cases, but it's probably overzealous here.

As a temporary workaround you can try disabling HTTP/2 support to see if it's only occurring in the HTTP/2 path, but if it's occurring in the HTTP/1 path then there may be other issues. Do you have logging enabled? It would be useful to see the logs to see the specific trigger for these issues, as I think this can only happen if we have hit a different error in the request pipeline already.

This appears to be a bit of a window condition. Out of curiosity, do you ever call .cancel on the HTTPClient.Task object?

@Lukasa Zach can correct me but I don't think we ever call .cancel

@Lukasa Zach can correct me but I don't think we ever call .cancel

Confirmed, cancel() does not appear anywhere in our codebase.

Will reply to your earlier comment ASAP.

Calling .cancel is the only way through the code I can confidently say can trigger this path, but that's sufficiently good to allow me to write a test that hits the crash, and I'll use it as a justification for defensive programming as well.

Hmm, this appears to be happening due to overzealous policing of the .finished state in the request bag state machine. Specifically, if we resume the request body stream but have already failed the request, we'll crash. The crash is reasonable in some cases, but it's probably overzealous here.

As a temporary workaround you can try disabling HTTP/2 support to see if it's only occurring in the HTTP/2 path, but if it's occurring in the HTTP/1 path then there may be other issues. Do you have logging enabled? It would be useful to see the logs to see the specific trigger for these issues, as I think this can only happen if we have hit a different error in the request pipeline already.

I set the log level to DEBUG, but I'm not sure what logs would be useful to share. Our backend serves many simultaneous requests, and this crash happens many times per minute at its peak. I haven't been able to isolate the logs that are related to the crash, and I would need to scrub the logs of sensitive information before sharing here, so I'll need to narrow the scope of the logs to share to make that practical.

I think we'd need to go to TRACE in order to see the appropriate logs. Unfortunately filtering is going to be hard: because the crash happens after an event loop hop it's not easy to work out exactly which request has hit the crash, so we can't filter by request ID.

Out of curiosity, has it been possible to try disabling HTTP/2?

You can also try using the branch from #577 to see how that affects your issue.

Sounds good, will change to TRACE. We have an isolated backend environment that I can point my frontend environment to, so I'm trying to reproduce the crash there shortly. If that works, the logs will be more useful because my requests will be the only ones in the logs.

I haven't tried disabling HTTP/2 yet. The DEBUG logs were showing requests to certain external services using HTTP/2 while others use HTTP/1.1, and I haven't been able to test the effect of disabling HTTP/2 outside of prod yet.

DEBUG  [ DEBUG ] connection closed [ahc-connection-id: 18, ahc-http-version: http/2, ahc-pool-key: https://{{removed}}:443 TLS-hash: -5432264542835967314] (AsyncHTTPClient/ConnectionPool/HTTPConnectionPool.swift:517)

DEBUG  [ DEBUG ] connection closed [ahc-connection-id: 27, ahc-http-version: http/1.1, ahc-pool-key: https://{{removed}}:443 TLS-hash: -5432264542835967314] (AsyncHTTPClient/ConnectionPool/HTTPConnectionPool.swift:474)

@Lukasa I think we're in good shape now. I deployed to production using #577 about 30 minutes ago, and have not seen a crash since. No logs contain Received signal 4. Backtrace since that deploy, compared to hundreds of logs mentioning Received signal 4. Backtrace in the prior hour. I'll continue to monitor.

Do you have any guesses about where this behavior might have come from? As far as we can tell, it was not correlated with any release of our application, and we haven't made changes to our use of the client or general configuration recently. Could unexpected behavior by an external service (that our application's HTTP client is sending requests to) cause this to occur?

Thanks so much for your help with this!

I suspect an external service, yes.

The specific theory I have is that this is a timing window. RequestBag is local to the Task event loop, so all information it receives about the underlying Channel state requires an event loop hop. That means that it can race with information from the Task event loop, or from any other thread in the program. In this instance, we might ask to resume the body after we've decided we don't need to upload it for some other reason.

The thing that gives me pause is that I don't think there are many things that can fail a Task in this way. Aside from cancel the only other theory I had is that if there's an error in a streaming RequestBody this might happen. But even that seems unlikely. So without logs it's hard to be totally sure what the cause is.

However, the precondition itself was a bit aggressive, so the fact that this can fail in even one way makes it fairly likely that it can fail in others.

Oh, and to clarify, if the other service's timing has changed that can affect the size of this window.

Thanks for explaining that.

Two hours since the deploy and we continue to be crash-free.

Sorry to say I wasn't able to reproduce the crash in a separate environment, so this will probably remain a mystery, but if we learn anything more I'll post it here.

I wanted to provide a bit more info in case it's of interest to you. During the period when these crashes were occurring, we had a prolonged increase in CPU utilization, occasionally spiking to 100%. (That CPU increase seems to be directly related to the onboarding a new customer that required our servers to process a higher volume of data than they typically do.)

Maybe there's a relationship between maxing out our CPU and the unexpected bevahior reported in this issue. Not sure what the mechanism would be, just a hunch.

The high CPU usage likely correlates with increased queue depth and contention on the event loop threads, which widens any timing window that involves scheduling on those threads.