There wree some nasty bugs in it that made ex. shard revival break. It wasn't worth the effort to hunt it down, so it's officially abandoned.
Distributed, cacheless* Discord bot sharding.
Discord enforces sharding your bot as the number of servers its on gets bigger. Having all of your shards in a single process is a huge risk to uptime though, as the process dying effectively kills your entire application. However, distributing these shards has always been a huge pain due to state coordination. That is, until now.
When the nodes come up, a master node is elected. Each node also creates a "cluster worker" that handles spawning shards. Once the master is up, it starts assigning shards to nodes. Shards are quite simply assigned to whichever node has the least shard processes on it at that point in time.
When a shard goes down, the master node is signaled, and will reassign the shard to a new node as necessary. This will also balance your node cluster over time. Later, it is planned that the master can proactively handle balancing shards across the cluster.
While this is a fully-functional Discord bot gateway lib. in its current state, a few vital pieces of functionality are missing. Specifically, I still need to implement a proper event queueing system so backend workers can recv events, as well as a way for users to sent gateway messages.
Session-stealing is a useful trick to help minimize bot downtime. Rather than have to go through the full connect-and-identify flow, you can "steal" the session from a previous shard with the same id / shard total, and resume the session on a new shard (potentially even on a different node!) so that you can continue your shard's session without any downtime.
G does not store its Discord entity cache in-memory. Caching is done externally with Redis. This is currently mandatory but will likely become optional someday. Ideally we could run entirely without cache - and the recent lazy guild changes make this possible for a lot of use cases, but my applications need some amount of cache that can't be faked without a lot of 429s.
Redis cache keys are stored as hashes:
$ indicates a variable
guild - g:cache:$shard:guild $id data
channel - stored on the guild object
role - stored on the guild object
member - g:cache:$shard:guild:$id:member $id data
user - g:cache:user $id data
Note that the user data is NOT stored in a per-shard cache. This is to avoid user duplication (which can easily be millions of users);
guilds / channels / roles / members are not global data (as a guild the bot is in is guaranteed to be on exactly 1 shard), so those
are stored as per-shard caches. If a shard has to do a full IDENTIFY
, existing data cached for that shard is deleted.
Example data:
// Guild example
// Stored at g:cache:1:guild 487256022529037787
{
"id": 487256022529037787,
"name": "my cool guild",
// ...
"channels": [
{
"id": 487256022529037787
"name": "general",
// ...
}
],
"roles": [
{
"id": 487256022529037787,
"name": "everyone",
// ...
}
]
}
// Member example
// Stored at g:cache:1:guild:487256022529037787:members 477256022529037786
{
"id": 477256022529037786,
"nick": null,
// ...
}
// User example
// Stored at g:cache:users 477256022529037786
{
"id": 477256022529037786,
"username": "my cool user",
"discriminator": "1234"
}
A dedicated external process for caching is problematic because it gives the potential for cache quickly becoming desynchronized with how it actually should be. However, passing off state between shard instances is a huge pain - a single shard's cache can easily be 200MB+, and as you have more shards this memory usage increases very quickly. Instead, we can store data externally in Redis and solve many problems.
Suppose you have M
nodes, and a target shard count N
. In an ideal world, each node would have exactly N / M
shards on it.
However, this isn't always possible. Suppose a case of M = 4
and N = 50
; the ideal case is each node having exactly 12.5
shards on it, which is impossible for obvious reasons. To solve this, we set a threshold of shards / node, so that the
cluster will be approximately balanced. Specifically, the master will try to balance shards across the cluster such that
each node will have round(M / N) ± 1
shards on it.
This will only work if you have a shard count and a node count that actually fit within these numbers.
G uses Raindrop for fetching snowflake IDs for many things. Eventually this may be a proper "pluggable" thing.
Voice support is a non-goal for G. Feel free to PR your own implementation, but I won't be making one.
- There is no mock for the Discord API.
- Distributed stuff turns out to be really hard to test.
- The lack of a Discord API mock makes it basically impossible to test a lot of this.
Set these environment variables for each node, and run with mix run --no-halt
.
BOT_TOKEN="aslidufkhaulsidf.asdfgaskjdf.asdfgkjuasdfghadsiujyf"
RAINDROP_URL="https://localhost:1234/"
REDIS_HOST="127.0.0.1"
REDIS_PASS="abcxyz123789"
NODE_NAME="my-awesome-node"
GROUP_NAME="my-awesome-group"
NODE_COOKIE="arjkyhgfvbakuwejsgdfbvkuwjsyhgfbckrujeywhgbakerwjfgaekwjufghbckjudeshcybgrvejwhuysdcbva"
G sends events to Q-backed queues in redis. Queue names are structured as discord:event:event_name
, ex
discord:event:guild_create
, discord:event:message_create
, ...
- Option for member chunking
- Detect a
^C
ofmix run --no-halt
? - Protect against the case of a 0-shard node
- Make external caching optional
- Proper backend event queueing
- Persist session / seqnum in redis
- Proper session stealing
- Zombie shard detection
- Rebalance shards