pnxtech/hydra

Redis streams?

rantecki opened this issue · 2 comments

TL;DR: Would it be better to use Redis streams for inter-service messaging?

A bit of background first: I've been using Hydra for a little while in production. One common complaint that has fed back to me is the occasional occurrence of timeouts when handling client requests (i.e. a sent message never gets a response within the accepted timeframe). So I've been looking into what I can do about this to improve reliability.

For some use cases, a couple of seconds of outage isn't that bothersome - e.g. for a web API the user could probably just refresh the page and go about their merry way. For other use cases services not responding can be quite jarring to the end-user experience. My use case tends towards the latter.

From what I can see there are 2 paths that could lead to this situation:

  1. Service outages - a service instance goes down momentarily (due to crash or other non-seamless restart)
  2. High load - high load conditions cause the service to not be able to respond to the request in time (due to event loop lag / starvation etc..)

I immediately suspected that high load would be the cause of our problems. However, on extensive testing I couldn't replicate a scenario where timeouts are induced, except under very unrealistic conditions. So I concluded that outages are the more likely culprit.

Hydra uses Redis pub/sub for inter-service messaging, which is essentially fire and forget. This is actually pretty reliable in practise (I've tried to break it, and it takes quite a bit of effort). Where things become murky though is that Hydra can take several seconds before acknowledging that a service is down. Related to this also is how Hydra does load balancing / resolution across multiple instances - it decides on the specific instance to route the message to up-front. So when a service instance crashes, even if there are multiple instances of the service running, messages can be routed to a dead instance for a period of time (a couple of seconds at least), resulting in timeouts.

If you don't have multiple services running (sometimes it just isn't practical cost-wise to do so or necessary from a load perspective), then you're probably out of the game for at least several more seconds (e.g. if running under Amazon ECS or similar). Note: I'm aware of the fallbackToQueue option in Hydra, but my use case doesn't involve the use of makeAPIRequest(), so it's not relevant to me.

Certainly we could argue that our services just shouldn't crash. Unfortunately, crashes are an unavoidable part of life in software development. A good architecture is one that can be tolerant to such things.

Recently I noticed that there was a new feature in Redis called Streams introduced in v5.0. From what I read, it sounds like streams could potentially be used to replace the use of pub/sub and solve some of these problems quite elegantly.

Queueing - Firstly, when a new message is added to a stream, it stays there until explicitly retrieved by the recipient. I guess the difference from pub/sub here is that with pub/sub the message queue is ephemeral and will disappear as soon as the connection drops. It can't be recovered. A stream though is given an explicit name (which could be tied to the name of the service) and is persistent, so could be picked up by the service when it restarts. This would allow for messages to be queued until the service is back up and running.

Load balancing / instance resolution - Streams may also provide a better load balancing mechanism. With Redis streams it appears possible to construct a worker-like queue where messages are distributed across multiple service instances, but each instance only handles a unique subset of the messages (see Consumer Groups). So when a message needs to be sent to a service (but not a particular instance of the service), it could be added to the general queue for that service, which is consumed by all instances.
With this setup we can’t end up in a situation where a message is sent to a dead service instance. Additionally, the recipients themselves make the decision on when they can handle a new message, so the load would arguably be more optimally distributed (compared to Hydra's random/shuffled distribution).

Metrics - No metrics are available with Redis pub/sub, so you can't see what's going on. The stream queue is readily observable though (e.g. via XINFO and XPENDING). The size of a particular queue (i.e. the number of unprocessed items waiting) could be used as an indicator of load, and a metric on which to base service scaling decisions.

One possible caveat is that in the case where a service is down for an extended period of time, the queue could potentially grow to a large size. When the service comes up again it would then be busy processing outdated requests which the sender has probably long abandoned. One possible workaround for this is to add an expiry date to each message to tell the recipient “if you get this message after this time, then just ignore it”. The expiry time could be set as slightly lower than the sender’s own timeout window, after which it gives up on receiving a response. Note that streams have the option of specifying the maximum size of the queue before dumping old messages (see the MAXLEN option of the XADD command). This would be useful to ensure that a queue can’t grow unboundedly.

It also appears possible for the sender to confirm whether the message has been accepted for processing yet, or still waiting in the queue. So it could potentially make more intelligent decisions based on this (i.e. wait longer, or send a more informative error message back to the client). If it chooses to give up, it could even remove the request from the queue to tidy up after itself.

Anyway, before I jump into potentially prototyping the integration of streams into Hydra, I just wanted to check if anyone on the team has heard of streams and considered their use in Hydra before. Perhaps there is a good reason they are not used. Maybe my analysis is faulty. Maybe it's already being worked on. Any comments would be appreciated.

cjus commented

@rantecki First off thank you for the detailed post above. We've been tracking Redis Streams for some time and they were well discussed at this year's RedisConf in San Francisco. The use of streams is of great interest to us as we've also explore the use of Kafka - but would rather use Redis where possible.

The single biggest reason for not adopting streams directly in hydra-core is that doing so would require Redis 5.x. For users who depend on hosted Redis clusters the wait for Redis 5 adoption might take a while.

One possibility is to add support for Redis Streams via the Hydra plugin feature so that they use can be selectively added. Keep in mind that for Hydra 2.0 we're planning on adopting IORedis.

Thanks for the input @cjus . I didn't realise Redis 5 was so recent (and it turns out actually still in beta - which would probably be a big issue for most production deployments).

I may go ahead with my own experimental implementation, either via plugins or not. I'm pondering the benefits of moving all explicit core functionality (e.g. messaging, service lookup) to plugins, so the core of Hydra is reduced to the config and plugin system. That way swapping in and out various components would be more straight forward. I've actually already had to hack in custom JSON serialisation as I'm using EJSON (which preserves dates).

Appreciate the link to the RedisConf vids - I'll definitely be watching a few of those.