redis-rb/redis-cluster-client

Skew in traffic load when fetching COMMAND during initialisation

Closed this issue · 1 comments

slai11 commented

Issue

When loading the commands in

, the RedisClient::Cluster::Command iterates over an array of nodes, calling COMMAND and returns on the first successful response. This can create a skew in load when deploying many applications. As COMMAND is a @slow command in https://redis.io/commands/command/, this could create unnecessary spikes in CPU and latency during deployment.

Locally reproducible example

Using bin/console:

x = (0..20).map{|| RedisClient.cluster(nodes: %w[redis://localhost:7001 redis://localhost:7002 redis://localhost:7003 redis://localhost:7101 redis://localhost:7102 redis://localhost:7103 redis://localhost:7201 redis://localhost:7202 redis://localhost:7203]).new_client}

On the master nodes of the local cluster, use redis-cli -p xxx client list | wc -l. Using my local setup, I could reproduce the skew in connections (and by extension, COMMAND calls) to the first node.

➜  ~ redis-cli -p 7001 client list | wc -l
      24
➜  ~ redis-cli -p 7101 client list | wc -l
       3
➜  ~ redis-cli -p 7201 client list | wc -l
       3

Proposal

In https://gitlab.com/gitlab-org/gitlab/-/merge_requests/128422, we patched the Redis::Cluster::SlotLoader, Redis::Cluster::NodeLoader, and Redis::Cluster::CommandLoader to perform a .shuffle.

This was able to spread out the initialization load across all nodes in the cluster (https://gitlab.com/gitlab-com/gl-infra/production/-/issues/16100#note_1507763635) as seen in the diagram:

Screenshot 2023-08-13 at 8 16 39 PM

Would the maintainers been keen on patching this behavious? I've opened #239 which gives a more even distribution in the local example:

➜  ~ redis-cli -p 7001 client list | wc -l
      10
➜  ~ redis-cli -p 7101 client list | wc -l
       8
➜  ~ redis-cli -p 7201 client list | wc -l
      12

I appreciate your feedback from real world. It means a lot to us. I'll review the pull request as soon as possible.