What's the possible reason leading difference peak in Server Load, Processor Time and CPU usage?

Question

What's the possible reason leading difference peak in Server Load, Processor Time and CPU usage?

maomaomqiu opened this issue 5 months ago · 6 comments

Hi all, greatly thanks for many of previous answers, I am now working on a purge task to clear persistent redis region, and when I trigger task, I notice there exist difference in server load, processor time and CPU usage in different region.

Region A

Configuration

configuration Premium 26 GB (2 × 13 GB)

Dashboards

the peak indicates the trigger time

there is no obvious increase in Server Load, Processor Time and CPU usage at that period
I trigger 3 times, every time dashboards are similar

Volume

Total keys ~8 million, purge task clear ~7 million keys

Condition

operation per second is 1.3-1.4 k before trigger
get operation is slightly higher than set operation

Region B

Configuration

configuration Premium 26 GB (2 × 13 GB)

Dashboards

the peak indicates the trigger time

Volume

Total keys ~6.5 million, purge task clear ~5.5 million keys
Region A and Region B key distribution are similar

Condition

operation per second is 1.8-1.9 k before trigger
get operation is double than set operation

Purge Task Logic

// List of keys pattern that need get from redis
string[] patterns;

 using (ConnectionMultiplexer connection = await ConnectionMultiplexer.ConnectAsync(config))
 {
    // get server
    Iserver server = connection.GetServers().// filter server and check server logic, then get a serer
    List<Task> tasks = new List<Task>(patterns.Length);

    foreach (var pattern in patterns)
    {
        tasks.Add(RedisPersistentKeyPurgeAction(connection, server, pattern));
    }
    await Task.WhenAll(tasks).ConfigureAwait(false);
}

private async Task RedisPersistentKeyPurgeAction(ConnectionMultiplexer connection, IServer server, string pattern)
{ 
     // batchExpire keys from redis
    List<string> batchExpireBuffer = new List<string>(50);

    var db = connection.GetDatabase();
    await using var keys = server.KeysAsync(pattern: pattern).GetAsyncEnumerator();
    bool isLastkey = !await keys.MoveNextAsync();

    while (! isLastKey) 
    {
        // every proccessed 100 keys, there will be a sleep
        await Task.Delay(200);

        // every 50 keys or current has reached last key matched, then batch set default time to live to those persistent keys
        if (statisfy some condtion)
         {
             await BatchSetExpiry(batchExpireBuffer, expiry, db);
             batchExpireBuffer.Clear();
          }
         // if it length of batchExpireBuffer < 50 
         if (batchExpireBuffer.Count < 50)
         {
             batchExpireBuffer.Add(keys.Current.ToString());
         }
         // other logic of iterator
         ....
    }
}

 private Task BatchSetExpiry(List<string> setExpiryList, int expiry, IDatabase db)
 {
         IBatch batch = db.CreateBatch();

         foreach (var key in setExpiryList)
         {
            // expiry is default expire time, 12 hours
             batch.KeyExpireAsync(key, TimeSpan.FromSeconds(expiry));
         }

         batch.Execute();
       // omit other logic, e.g. exception handling
        return Task.CompletedTask;
 }

I wonder, do you have any ideas why cause difference?

Answer 1 · 2024-06-08T11:59:47.000Z

It seems like this is a server-side question really, not a client one. Is the client doing anything incorrect here? I'm reading your question as "why didn't the first server have the same impact?" - there are many reasons if that's the case from shard counts to SKU sizes, etc. - we can't really speak to server impact here because that's side widely variable depending on the hosting setup, replication, latency, concurrent load, etc.

If you can repro bad load patterns, it'd be best to engage the hosting team here to pose that question. If I'm missing a client-side question though: please clarify, happy to answer.

Answer 2 · 2024-06-12T01:32:38.000Z

Thanks @NickCraver , I can repro.
Bad load pattern - it is tolerable, due to peak is only in 1 minutes.
I just wonder the possible root cause, so that if further similar operation needed, I can avoid peak

Answer 3 · 2024-06-12T01:34:02.000Z

Each peak represents a trigger

PS: I found the peak have nothing to do with the total key amount matched

Answer 4 · 2024-06-12T10:28:41.000Z

Hi @NickCraver , could you provide some ways to engage hosting team? Greatly thanks in advance!

Answer 5 · 2024-06-12T15:25:32.000Z

@maomaomqiu Please engage support via 'Support + Troubleshoot' on the cache in the Portal

Answer 6 · 2024-06-13T08:33:15.000Z

Thanks for reply, you mean azure portal? @philon-msft