Skew in traffic load when fetching COMMAND during initialisation
Closed this issue · 1 comments
Issue
When loading the commands in
, theRedisClient::Cluster::Command
iterates over an array of nodes, calling COMMAND
and returns on the first successful response. This can create a skew in load when deploying many applications. As COMMAND
is a @slow
command in https://redis.io/commands/command/, this could create unnecessary spikes in CPU and latency during deployment.
Locally reproducible example
Using bin/console
:
x = (0..20).map{|| RedisClient.cluster(nodes: %w[redis://localhost:7001 redis://localhost:7002 redis://localhost:7003 redis://localhost:7101 redis://localhost:7102 redis://localhost:7103 redis://localhost:7201 redis://localhost:7202 redis://localhost:7203]).new_client}
On the master nodes of the local cluster, use redis-cli -p xxx client list | wc -l
. Using my local setup, I could reproduce the skew in connections (and by extension, COMMAND
calls) to the first node.
➜ ~ redis-cli -p 7001 client list | wc -l
24
➜ ~ redis-cli -p 7101 client list | wc -l
3
➜ ~ redis-cli -p 7201 client list | wc -l
3
Proposal
In https://gitlab.com/gitlab-org/gitlab/-/merge_requests/128422, we patched the Redis::Cluster::SlotLoader
, Redis::Cluster::NodeLoader
, and Redis::Cluster::CommandLoader
to perform a .shuffle
.
This was able to spread out the initialization load across all nodes in the cluster (https://gitlab.com/gitlab-com/gl-infra/production/-/issues/16100#note_1507763635) as seen in the diagram:
Would the maintainers been keen on patching this behavious? I've opened #239 which gives a more even distribution in the local example:
➜ ~ redis-cli -p 7001 client list | wc -l
10
➜ ~ redis-cli -p 7101 client list | wc -l
8
➜ ~ redis-cli -p 7201 client list | wc -l
12
I appreciate your feedback from real world. It means a lot to us. I'll review the pull request as soon as possible.