Kraigie/nostrum

Feature planning: Distributed ratelimiting

jchristgit opened this issue · 0 comments

We currently have a single ratelimiter state machine per bot, which, according
to user reports, is the best-in-class rate limiting solution out there.
However, we cannot distribute this across multiple nodes, so we can be even
better.

The issue here is that we need to track state, and in busy bots, it's going to
be a lot of it. The state machine ratelimiter is implemented in a way that will
essentially not allow any external state, as it uses timers of :gen_statem
and internal queues to track when it needs to send off what, all whilst being
as conservative on memory as possible.

We should determine how ratelimiting works on large bots, and whether it's
scoped by the bot token itself and no further keying (e.g. some form of REST API
sharding), but I believe it is scoped on the bot token.

From a quick thought, I believe the best approach would be to:

  • Deploy the state machine on all nodes that the developer wants us to run on.
  • Determine the target state machine in the cluster based on consistent hashing
    of the rate limiting key.
  • Find a way to synchronize global rate limits between these instances, if
    possible.
  • On start and stop of nodes and redistribution of rate limiter state machines,
    we need to either hand off rate limiter state or drain the existing rate
    limiter in some smart way. How exactly this should be accomplished is
    something we need to flesh out.