Timeout exception conecting Azure Cache for Redis from AppService

Question

Timeout exception conecting Azure Cache for Redis from AppService

llopezalonso opened this issue 3 months ago · 5 comments

llopezalonso commented 3 months ago

We are experiencing timeouts with Redis when 10k request are sent by an appService to Redis.

Azure resources:

Azure Cache for Redis is Premium P2
AppService P2V3 (2 instances)

AppService Code:

NET 8
Using package StackExchange.Redis v2.7.33
ThreadPool.SetMinThreads(256, 256);

Error shown in AppInsights:

Timeout awaiting response (outbound=67648KiB, inbound=328KiB, 5250ms elapsed, timeout is 5000ms), command=HGET, next: HGET MICLAVEBBDD, inst: 0, qu: 0, qs: 0, aw: False, bw: SpinningDown, rs: ReadAsync, ws: Idle, in: 65536, last-in: 14161, cur-in: 8719, sync-ops: 0, async-ops: 10008, serverEndpoint: XXXX.redis.cache.windows.net:6380, conn-sec: 385.31, aoc: 0, mc: 1/1/0, mgr: 9 of 10 available, clientName: wnRURURU0000D9(SE.Redis-v2.8.0.27420), IOCP: (Busy=0,Free=1000,Min=256,Max=1000), WORKER: (Busy=21,Free=32746,Min=256,Max=32767), POOL: (Threads=80,QueuedItems=835,CompletedItems=20361,Timers=17), v: 2.8.0.27420 (Please take a look at this article for some common client-side issues that can cause timeouts:
https://stackexchange.github.io/StackExchange.Redis/Timeouts)

Any thoughts that can help us??

Ton of thanks

Answer 1 · 2024-06-26T08:12:32.000Z

We have the same challenges on the same stack. Running on Azure App Service with .NET 6 (and now 8) and REDIS. After almost 7 years now, we made the following learnings:

CPU pressure is real. When our AVG CPU goes above 80%, we start noticing REDIS issues.
Setup Private Endpoints to by-pass SNAT throttling. Connect to your REDIS, SQL, CosmosDB etc with a private endpoint. Without a private endpoint, your app will communicatie through the public network and that is shared, so you are limited. Check SNAT port exchaustion in the "diagnose and solve problems" tab.
We applied the "best practices" from here. We even use that "simple" retry handling, because the pre-V8 Polly library added a lot of CPU and memory overhead as we benchmarked it with benchmarkdotnet.

Some low hanging fruits:

Big keys - instead of requesting 10000's of keys in a MGET, we started to do "paged" MGET's
Make sure your REDIS instance isn't too busy with evicting keys.

Answer 2 · 2024-06-26T10:10:29.000Z

Ton of thanks for your response and your insights.

Regarding your comments:

We do not see CPU pressure in appService (with 10k request, our appservice is about 40% of CPU use)
We are currently using private endpoints to access Redis (public access is not enabled)
We do not handle retry, our code to connect Redis is quite simple (we have follow this post)

Regarding the other comments:

We are making 10k requests through Azure Load Testing, the appservice code retrieves one result through HashGetAsync so I cant apply pagination :(
EvictedKeys metric is always 0

Again, ton of thanks for your comments and ideas, we are stuck on this problem :(

Answer 3 · 2024-07-02T15:25:29.000Z

What kind of sends are involved here? On the outbound we see 67648KiB which is quite a bit in queue on the ounbound side of the socket - is this storing very large keys?

Answer 4 · 2024-07-04T17:29:32.000Z

Thanks for your response @NickCraver

No, we are not storing large keys (aprox 13k values in the database and larger is 15kb)

Answer 5 · 2024-07-09T15:01:28.000Z

@llopezalonso The outbound was 67648KiB, so even assuming 15KiB upper bound that's around 4,700 outbound keys being stored at once as a spike in traffic - does that sound like the intended behavior or fail a gut check?