Handling Connection Spikes in WebSocket Server with Redis

Question

Handling Connection Spikes in WebSocket Server with Redis

Closed this issue 6 months ago · 3 comments

Hi everyone,

I'm seeking some guidance on how to best manage connection spikes in my WebSocket server application, which heavily relies on Redis PubSub, Sorted Sets, and Lists.

Problem Description:

Our application experiences intermittent but significant spikes in user activity throughout the day. These spikes lead to a large number of concurrent connection initializations in our Redis connection pool. This, in turn, results in:

context deadline exceeded errors.
Slow Redis command execution times.

Current Mitigation Strategy:

I've attempted to address this by increasing the MinIdle setting in our Redis connection pool. Previously, with a lower MinIdle (around 1000), we encountered frequent connection-related errors during peak usage. By increasing MinIdle to 3000, the occurrence of these errors has significantly reduced (down to under 100).

Setup Details:

Redis Server: AWS Elasticache Redis (version 7+) with cluster mode enabled.
Application Instances: 4 EC2 instances running the WebSocket server application.
Redis Connection Pool Configuration (per application instance):

PoolSize - 5000
MinIdle - 3000
MaxIdle - 3500
connIdleTimeout - 9h
connLifetime - 12h
readTimeout - 10s
writeTimeout - 10s

Question:

While increasing MinIdle has provided some relief, I'm wondering if this is the most efficient or recommended approach. Are there alternative strategies or configurations I should consider to better handle these connection spikes and ensure the stability and performance of our application? Any insights or suggestions would be greatly appreciated!

Answer 1 · 2025-04-29T09:59:32.000Z

Hello @akshaykhairmode, Can you share more about your setup? What is the context you are passing to the clients? Why are the context timing out?

Answer 2 · 2025-04-29T10:16:31.000Z

Hi @ndyakov , the context passed is context.Background.

I have increased the pool size to minIdle as 3500 and maxIdle as 3800 and changed the maxIdleTime and maxLifetime to -1 to disable those checks.

Now I do not see context deadline exceeded but slowness is there which I have reported in #3359

Answer 3 · 2025-05-01T07:12:41.000Z

Continued discussion in the issue from the comment above. Closing this one.