tokio-rs/tokio-core

Support Readiness Types with Signal Counters

cramertj opened this issue · 18 comments

Currently, IoToken and the like rely upon being able to convert mio::Ready to and from usize so that it can be stored in an AtomicUsize. Similarly, mio internally represents Ready as a usize, so any platform-specific readiness information must be convertible to usize and back. This is fine for epoll-based implementations which only provide readiness notifications in the form of individual signal bits, but Windows and Fuchsia provide extra information such as signal counters (see lpNumberOfBytes and zx_packet_signal.count). It seems like it would be possible to support sending this extra information by using platform-specific representations for Ready and AtomicReady. Are there any reasons not to do this?

On a related note, has there been any thought about the best way to support signal counters via the tokio/futures ecosystem? Fuchsia's reactors are mostly designed around a single-reactor-per-thread model, where all incoming signals are processed before another reactor crank. The reason for this is that it allows the kernel to control latency by setting signal counters which tell your application exactly how many and what signal to process in a single crank.

This seems sort of fundamentally incompatible with futures design, although I'm sure we can find good solutions to bridge the two approaches. The problem is that since futures operate on a pull-based model rather than a push-based model, they won't always process whatever signal just arrived.

My plan at the moment is to use AtomicU64 readiness where the first 32 bits are the signal and the last 32 are a count, and each update will use a load and a compare_and_swap to bitwise-or the signal bits and saturating add the count bits (using u32::MAX as a sentinel value for "process forever"). This should work fine for single-threaded scenarios, but will become less predictable when the tokio reactor cranks notably faster than readiness signals are processed by the future.

cc @carllerche @alexcrichton @raggi

In the general case things get even more interesting. A Fuchsia port packet actually describes the completion of an asynchronous operation. For a wait, the completion packet includes a bitfield of signals and a count. However, we may define other asynchronous operations for which the completion packet would include different information, such as number of bytes transferred.

So ideally we would have a way to associate a operation-dependent payload with completion of asynchronous operations (including but not limited to waiting for signals).

@j9brown Yeah, so in a future Fuchsia with support for more different packet types, I'd figure Ready notifications would basically just be a wrapper for a Mutex<SmallVec<PacketContents>>.

@j9brown I've been thinking more about it, and in a world with custom port packets like you describe, I think we'd be better off using our own event loop rather than trying to shoehorn our way into mio and tokio-core, which are both designed around epoll-style async signals rather than actual async messages, and are designed around a PollEvented which is Send and Sync. We can still reuse futures-rs in our custom loop, and we can continue to support mio and tokio-core's TCP and UDP primitives in order to have interop with other Rust libraries (eg. hyper). It's a bit unfortunate, but I think the models are different enough that it's worth diverging.

Using a custom message loop may be more efficient overall. For example, we will be able to take advantage of the fact that Fuchsia port packets include a 64-bit key which can be used to accelerate lookup of the appropriate callback to dispatch. We'll also be able to avoid the burden of type erasure otherwise incurred by shoehorning through more abstract interfaces.

@j9brown mio and tokio-core use those keys currently in order to dispatch to the appropriate task (though they use a usize key rather than a u64 key, and I do the appropriate casting in the Fuchsia mio backend).

RE type erasure: I'm not sure quite what you're referring to here. There are a couple of instances of Box<Future<...>> trait objects, but they're used very sparingly for top-level spawned futures (of which a FIDL server has only has one, and keeps the rest in a FuturesUnordered). Overall, I don't think that tokio is forcing us into dynamic dispatch any more than a custom event loop would.

(FWIW, after this CL, there will only be two allocation spots on the FIDL server's request path: the insertion of the response future into FuturesUnordered, and the allocation of the FIDL message itself.)

This is indeed quite an interesting question! I agree though that I don't think this maps too well onto the current futures/tokio-core/PollEvented model. Most of futures/tokio are built around the ability to have spurious notifications and guarantees are few and far between. In that sense when you wake up a future you actually have no idea if it'll do work based on the wakeup, you just know that it knows it shouldn't go to sleep at some point.

In that sense trying to actually transmit data from the kernel gets pretty difficult as you need a guarantee that it's delivered to a particular location and the location is inspected at an appropriate time. It's possible that this could indeed be shoehorned in but the idea of a custom event loop here does seem like it may be more plausible.

In general though I find it useful to work backwards on problems like these. For example, what does a Fuchsia server look like without futures actually processing all these notifications? I'd imagine that you get a number of events from the kernel and then block on dispatching all of these to a "task". With a futures-like model you'd basically be creating an executor for objects which take these notifications as input.

One mechanism may be to define your own custom "future trait" which takes an event as input perhaps? You could then internally bridge a future to your custom trait, and then you'd have your own custom reactor working with futures of this spawned type. I'm not sure how far that'd compose in the stack though and how well it'd fit together, for example it may not compose well if you're waiting on multiple events...

@alexcrichton My thought was to use the current futures trait but swap out PollEvented's IoToken for a list of packets. Similar to how Handle gets passed around now, we could pass around a reference to Fuchsia's reactor, which individual PollEventeds could use to register their async event with their token and create a mapping from their token to (a) the current task so the future can be awoken and (b) a weak reference to their list of events (rc::Weak<RefCell<Vec<Packet>>>).

If I understand correctly, this is pretty similar to how add_source(...) and handle.send(...) work currently, except that we (a) perform the call to register ourselves (since there are a number of possible register equivalents) and (b) share rc::Weak<RefCell<Vec<Packet>>> with the reactor instead of Arc<AtomicUsize>.

@j9brown @alexcrichton Does this seem reasonable to you? What issues do you think we'll encounter?

This model is also compatible with tokio-core if we (a) make a non-Send/Sync version of PollEvented and (b) make some changes to how Ready works (break the dependency on usize, split apart into ReadyRequest and ReadyResponse or similar, and allow platform-specific implementations of these types). These changes would be fairly significant, but they do offer some benefits for other platforms such as Windows and Solaris where the ReadyRequest type is different from the ReadyResponse type. Similarly, non-Send/Sync PollEvented would allow performance benefits for applications using a reactor-per-thread model.

That sounds like it's definitely work! My only concern would be that at extreme scales you may run into two issues:

  • The data structure and queues themselves may take up a good but of memory, and as a result could limit the overall capacity of a server. This is definitely an extreme though of course

  • There may need to be a mechanism of back pressure to prevent these queues from being too long, otherwise a slow consumer may have a massive queue and overload the system

Yeah, I figured this came with a "light assumption" that:

  • once a future registers itself to receive events, it's getting polled every crank that events arrive for it
  • every time it gets polled, it handles the full list of events and deregisters itself from receiving more events if it's getting too many to handle (and calls task::notify() so that it will be polled again).

Makes sense! In fact if these handles are not sendable then that is a great assurance perhaps that you are indeed consuming the events at an appropriate rate! You could even include debug asserts that all events are consumed each turn

To be honest, I don't know enough about Rust or the existing event loop implementations to comment intelligently on the proposal. I will note a few goals, however:

  1. IPC is the heart of Fuchsia so we want sending, receiving, and dispatching messages and their replies to be extremely efficient.
  2. We'd like to be able to send and receive IPC messages without dynamic heap allocation. FIDL v2 will make this easier by allowing the message layout and size bounds to be determined statically.
  3. Fuchsia services will commonly be awaiting completion of many operations concurrently and this set will evolve dynamically over time. It's important that registration, unregistration, and dispatch occur in constant time.
  4. We'd like many Fuchsia system services to support multi-threaded dispatch of requests from independent clients. (More generally, this requires an awareness of serialization domains.)
  5. "polling" might not be the verb you want to have in the heart of an event loop API since it suggests a pattern of checking a state and occasionally doing nothing. This is a common feature of level-triggered mechanisms like select() and epoll() whereas Fuchsia zx_object_wait_async() more closely resembles edge-triggered behavior (more precisely, it enqueues a packet at completion). Perhaps consider abstractions and terminology related to work queues, tasks, or actors instead.

For an example of why these goals are important, consider what it would be like to implement a file or network server in Rust which may at times be called upon to handle many thousands of transactions per second.

Clearly it's ok for some heap allocation and other bookkeeping related operations to take place, such as when a file or socket is opened, but we'd like to avoid introducing unnecessary overhead as a matter of course in the underlying signaling, serialization, and dispatch primitives.

@j9brown Thanks for sharing the goals! That's really helpful. I'm fairly confident that the approach I've outlined will be able to meet those expectations, but as always we'll have to do some experimentation to figure out what works best.

"polling" might not be the verb you want to have in the heart of an event loop API since it suggests a pattern of checking a state and occasionally doing nothing.

To clarify, when I said "poll", I meant literally calling the poll function on the Future trait. If that function is being called, it means that the future's task has been notified that there are events it should pick up and react to.

I'm going to go ahead and close this issue, since it seems like the general agreement is that the solution is to build another system separate from tokio-core. I'll plan to follow up on the rust-fuchsia Google group and on #fuchsia-rust on freenode.

@j9brown If you're looking for an equivalent, Future::poll is similar to async_task.handler in libasync.

The primary downside to a custom event loop is you lose the ability to run signal counters on the same thread as any existing rust libs built to support Tokio.

I'm open to make changes to Mio to better support this case, but starting with a custom event loop would be good if only to get something working and illustrate the case.

Part of my concern is that the wins won't be immediately obvious. Most of the big differences will come when Fuchsia ports start delivering packets that contain more than just signal data (namely, the operation-dependent payloads @j9brown mentioned above).

Is it possible to abstract the protocol used to await signals so as to allow for different event loop implementations to provide compatible functions needed by existing rust libraries?

@j9brown Yes, that's what I'm working on (the futures-rs libraries). However, some Rust libraries depend on tokio-core to provide async networking primitives, and so they won't be immediately compatible with our event loop. If we want to use tokio-core in the future, tokio-core and mio will have to allow platform-specific expansions to the readiness functions they provide.

I suppose we could make libraries generic on the type of event loop they use, and each event loop could have an associated type for TcpHandler etc... However, that would mean that all of the networking libraries in the ecosystem would have to either use trait objects (type erasure) or be generic on the event loop they use.