smol-rs/async-channel

Document channel (non)-fairness

abonander opened this issue · 9 comments

In SQLx we originally tried using async-std's MPMC channels to implement our connection pool but found that particular implementation preferred high throughput over fairness, which is undesirable for a connection pool meant for use in a webserver; at high load some clients would time out waiting for a connection while other clients were served successfully. We pretty quickly switched to our own ad-hoc implementation, but that still suffers from unfairness: launchbadge/sqlx#453

Though I don't think that's impossible to remedy in our current implementation, I think using a ready-made, fair MPMC channel would simplify things a lot.

I don't see any documentation about what guarantees this crate can make regarding fairness, is that something that's been considered/characterized yet?

I could implement something similar to what async-mutex does: we make things unfair (whoever gets more lucky goes first), but if a recv()/send() operation is starved for more than 0.5 sec ms, it raises a flag that says "ok that's enough, everybody get in line", at which point existing and new recv()/send() operations must be fair and get in line.

parking_lot does something similar and calls this "eventual fairness". It's not clear to me how (or if at all) we could make try_recv() and try_send() operations respect fairness, but we could certainly implement an eventual fairness strategy for recv() and send().

What do you think? I could implement this in a branch for you to try out...

Eventual fairness seems like it would be good enough if it helps to keep average latency down, though I think we would like to be able to tune the threshold for starvation and switching to the fairness regime since 500ms seems like a long time to starve a client. I realize you don't want that too low either or else you might as well not have a threshold at all and just always be fair, but I think we'd prefer response times at increasingly high load to degrade gracefully across the board rather than randomly spike. I'm not completely certain about that though.

I think for try_recv() to respect fairness it would have to always return Err(Empty) if the channel is in fair mode and there's tasks waiting in the queue; that's essentially the fix I'm implementing for launchbadge/sqlx#453.

On an unrelated note, I think if we switch to using an async MPMC channel I would also like to utilize a semaphore for controlling when new connections can be made (since that's another thing we've basically reimplemented in our bespoke impl for Pool), but I'm having trouble finding an executor-agnostic async semaphore. Any recommendations?

Ooops, I meant 0.5 ms, not 500 ms! :) Would 0.5 ms be acceptable?

Btw, this whole thread is relevant and I found it insightful: golang/go#11506 (comment)

Oh yeah, 0.5ms is perfectly acceptable, although if that clock is always ticking even when messages aren't going through the channel then in a use-case like a Pool where we expect all connections to be checked out most the time then we might always be considered in starvation mode and forced to be fair. However, I don't really know how a receiver could know the difference between starvation and just no messages coming through the channel.

The discussion of the tradeoffs in that linked thread is definitely interesting. I think the concern with fairness there is that even if a message is made available to a given Goroutine, there's no guarantee that the runtime will resume executing that Goroutine immediately and so the message will just sit in the channel (or messages get served out of order, which sounds like an orthogonal issue to me). So by being unfair, the tradeoff is potential receiver starvation but high throughput, which is exactly what async-std chose to do with their channels.

High throughput is definitely desirable for a pool since we want to most efficiently utilize limited resources, i.e. database connections, by having them sit idle in the pool as little as possible. But from a perspective of serving clients, we want to avoid timeouts where possible, so receiver starvation is pretty undesirable. We also don't necessarily care if connections are used in FIFO order; heck, we may prefer LIFO (or at least to have the option) since that would mean that the caches for that connection are warm.

Maybe an MPMC channel just fundamentally has different design concerns and shouldn't really be used for this purpose, even if it would superficially simplify things a lot. I also haven't really tested my fix for our current problem yet so maybe there isn't really a need to gut the implementation at this time.

davxy commented

DQ: are bounded/unbounded channels provided by this library FIFO when there is a single consumer?
I.e. recv respects the send order over the same channel

Yes, this is correct

davxy commented

Nice, is this documented somewhere? I'm not sure it is obvious