StackExchange/StackExchange.Redis

Timeout exception conecting Azure Cache for Redis from AppService

llopezalonso opened this issue · 5 comments

We are experiencing timeouts with Redis when 10k request are sent by an appService to Redis.

Azure resources:

  • Azure Cache for Redis is Premium P2
  • AppService P2V3 (2 instances)

AppService Code:

  • NET 8
  • Using package StackExchange.Redis v2.7.33
  • ThreadPool.SetMinThreads(256, 256);

Error shown in AppInsights:

Timeout awaiting response (outbound=67648KiB, inbound=328KiB, 5250ms elapsed, timeout is 5000ms), command=HGET, next: HGET MICLAVEBBDD, inst: 0, qu: 0, qs: 0, aw: False, bw: SpinningDown, rs: ReadAsync, ws: Idle, in: 65536, last-in: 14161, cur-in: 8719, sync-ops: 0, async-ops: 10008, serverEndpoint: XXXX.redis.cache.windows.net:6380, conn-sec: 385.31, aoc: 0, mc: 1/1/0, mgr: 9 of 10 available, clientName: wnRURURU0000D9(SE.Redis-v2.8.0.27420), IOCP: (Busy=0,Free=1000,Min=256,Max=1000), WORKER: (Busy=21,Free=32746,Min=256,Max=32767), POOL: (Threads=80,QueuedItems=835,CompletedItems=20361,Timers=17), v: 2.8.0.27420 (Please take a look at this article for some common client-side issues that can cause timeouts:
https://stackexchange.github.io/StackExchange.Redis/Timeouts)

Any thoughts that can help us??

Ton of thanks

We have the same challenges on the same stack. Running on Azure App Service with .NET 6 (and now 8) and REDIS. After almost 7 years now, we made the following learnings:

  • CPU pressure is real. When our AVG CPU goes above 80%, we start noticing REDIS issues.
  • Setup Private Endpoints to by-pass SNAT throttling. Connect to your REDIS, SQL, CosmosDB etc with a private endpoint. Without a private endpoint, your app will communicatie through the public network and that is shared, so you are limited. Check SNAT port exchaustion in the "diagnose and solve problems" tab.
  • We applied the "best practices" from here. We even use that "simple" retry handling, because the pre-V8 Polly library added a lot of CPU and memory overhead as we benchmarked it with benchmarkdotnet.

Some low hanging fruits:

  • Big keys - instead of requesting 10000's of keys in a MGET, we started to do "paged" MGET's
  • Make sure your REDIS instance isn't too busy with evicting keys.

Ton of thanks for your response and your insights.

Regarding your comments:

  • We do not see CPU pressure in appService (with 10k request, our appservice is about 40% of CPU use)
  • We are currently using private endpoints to access Redis (public access is not enabled)
  • We do not handle retry, our code to connect Redis is quite simple (we have follow this post)

Regarding the other comments:

  • We are making 10k requests through Azure Load Testing, the appservice code retrieves one result through HashGetAsync so I cant apply pagination :(
  • EvictedKeys metric is always 0

Again, ton of thanks for your comments and ideas, we are stuck on this problem :(

What kind of sends are involved here? On the outbound we see 67648KiB which is quite a bit in queue on the ounbound side of the socket - is this storing very large keys?

Thanks for your response @NickCraver

No, we are not storing large keys (aprox 13k values in the database and larger is 15kb)

@llopezalonso The outbound was 67648KiB, so even assuming 15KiB upper bound that's around 4,700 outbound keys being stored at once as a spike in traffic - does that sound like the intended behavior or fail a gut check?