stefanprodan/AspNetCoreRateLimit

Distribute cache concurrency issue

Caskia opened this issue ยท 34 comments

Hi @stefanprodan ,
When deploying a lots of api services(use distributed cache to store counter) behind load balance, every instance need to get conuter and use process' lock to increse count, as the code below

 using (await AsyncLock.WriterLockAsync(counterId).ConfigureAwait(false))
            {
                var entry = await _counterStore.GetAsync(counterId, cancellationToken);

                if (entry.HasValue)
                {
                    // entry has not expired
                    if (entry.Value.Timestamp + rule.PeriodTimespan.Value >= DateTime.UtcNow)
                    {
                        // increment request count
                        var totalCount = entry.Value.Count + _config.RateIncrementer?.Invoke() ?? 1;

                        // deep copy
                        counter = new RateLimitCounter
                        {
                            Timestamp = entry.Value.Timestamp,
                            Count = totalCount
                        };
                    }
                }

                // stores: id (string) - timestamp (datetime) - total_requests (long)
                await _counterStore.SetAsync(counterId, counter, rule.PeriodTimespan.Value, cancellationToken);
            }

I think it has concurrency issues, counter will not work correctly when a lots of requests come in. Every instance's counter count by themself then save to the cache store, the last one will cover the previous one.

Why not use Token Bucket algorithm to achieve this feature.

Hi @Caskia,

Your observation is correct. How would you implement the Token Bucket algorithm in this particular case?

Please feel free to contribute with details, pseudo-code or open up a new pull request.

Thanks

the error :

[2019-09-05T12:08:12.6390870+00:00  Microsoft.AspNetCore.Server.Kestrel ERR] Connection id "0HLPFG8HQJTUG", Request id "0HLPFG8HQJTUG:00000003": An unhandled exception was thrown by the application.
System.OperationCanceledException: The operation was canceled.
   at System.Threading.CancellationToken.ThrowOperationCanceledException()
   at Microsoft.Extensions.Caching.Redis.RedisCache.RefreshAsync(String key, Nullable`1 absExpr, Nullable`1 sldExpr, CancellationToken token)
   at Microsoft.Extensions.Caching.Redis.RedisCache.GetAndRefreshAsync(String key, Boolean getData, CancellationToken token)
   at Microsoft.Extensions.Caching.Redis.RedisCache.GetAsync(String key, CancellationToken token)
   at Microsoft.Extensions.Caching.Distributed.DistributedCacheExtensions.GetStringAsync(IDistributedCache cache, String key, CancellationToken token)
   at AspNetCoreRateLimit.DistributedCacheRateLimitStore`1.GetAsync(String id, CancellationToken cancellationToken)
   at AspNetCoreRateLimit.IpRateLimitProcessor.GetMatchingRulesAsync(ClientRequestIdentity identity, CancellationToken cancellationToken)
   at AspNetCoreRateLimit.RateLimitMiddleware`1.Invoke(HttpContext context)
   at Microsoft.AspNetCore.Server.Kestrel.Core.Internal.Http.HttpProtocol.ProcessRequests[TContext](IHttpApplication`1 application)

from it ?

Thanks for report.

@cristipufu hi, check the error message please ?

There's a workaround for the distributed cache issue, but requires a little bit of coding (be aware that this is only a concept and it's not tested):

Overwrite the rate limiting middleware part:

DistributedClientRateLimitMiddleware .cs
public static IApplicationBuilder UseDistributedClientRateLimiting(this IApplicationBuilder builder)
{
  return builder.UseMiddleware<DistributedClientRateLimitMiddleware>();
}

public class DistributedClientRateLimitMiddleware : RateLimitMiddleware<DistributedClientRateLimitProcessor>
{
  private readonly ILogger<ClientRateLimitMiddleware> _logger;

  public DistributedClientRateLimitMiddleware(RequestDelegate next,
          IOptions<ClientRateLimitOptions> options,
          IRateLimitCounterStore counterStore,
          IClientPolicyStore policyStore,
          IRateLimitConfiguration config,
          ILogger<ClientRateLimitMiddleware> logger)
      : base(next, options?.Value, new DistributedClientRateLimitProcessor(options?.Value, counterStore, policyStore, config), config)
  {
      _logger = logger;
  }

  protected override void LogBlockedRequest(HttpContext httpContext, ClientRequestIdentity identity, RateLimitCounter counter, RateLimitRule rule)
  {
      _logger.LogInformation($"Request {identity.HttpVerb}:{identity.Path} from ClientId {identity.ClientId} has been blocked, quota {rule.Limit}/{rule.Period} exceeded by {counter.Count}. Blocked by rule {rule.Endpoint}, TraceIdentifier {httpContext.TraceIdentifier}.");
  }
}

Now, we just have to overwrite the ClientRateLimitProcessor.ProcessRequestAsync() method, replacing the AsyncLock in-process lock with a Redis distributed lock:

DistributedClientRateLimitProcessor.cs
public class DistributedClientRateLimitProcessor : ClientRateLimitProcessor
    {
        private readonly IRateLimitCounterStore _counterStore;
        private readonly IRateLimitConfiguration _config;
        private readonly RedLockFactory _redLockFactory;

        public DistributedClientRateLimitProcessor(ClientRateLimitOptions options,
            IRateLimitCounterStore counterStore,
            IClientPolicyStore policyStore,
            IRateLimitConfiguration config,
            IRedLockFactory redLockFactory)
            : base(options, counterStore, policyStore, config)
        {
            _counterStore = counterStore;
            _config = config;
            _redLockFactory = redLockFactory; // this should be somehow singleton injected
        }

        public override async Task<RateLimitCounter> ProcessRequestAsync(ClientRequestIdentity requestIdentity, RateLimitRule rule, CancellationToken cancellationToken = default)
        {
            var counter = new RateLimitCounter
            {
                Timestamp = DateTime.UtcNow,
                Count = 1
            };

            var counterId = BuildCounterKey(requestIdentity, rule);
            var expiry = TimeSpan.FromSeconds(2);

            // serial reads and writes on same key
            using (var redLock = await _redLockFactory.CreateLockAsync(counterId, expiry)) // there are also non async Create() methods
            {
                // make sure we got the lock
                if (redLock.IsAcquired)
                {
                    var entry = await _counterStore.GetAsync(counterId, cancellationToken);

                    if (entry.HasValue)
                    {
                        // entry has not expired
                        if (entry.Value.Timestamp + rule.PeriodTimespan.Value >= DateTime.UtcNow)
                        {
                            // increment request count
                            var totalCount = entry.Value.Count + _config.RateIncrementer?.Invoke() ?? 1;

                            // deep copy
                            counter = new RateLimitCounter
                            {
                                Timestamp = entry.Value.Timestamp,
                                Count = totalCount
                            };
                        }
                    }

                    // stores: id (string) - timestamp (datetime) - total_requests (long)
                    await _counterStore.SetAsync(counterId, counter, rule.PeriodTimespan.Value, cancellationToken);
                }
            }
            // the lock is automatically released at the end of the using block

            return counter;
        }
    }

One can implement a distributed lock mechanism in Redis, and there already are some implementations out there that we could use (or at least inspire from):

Then we just have to replace the IRateLimitCounterStore implementation with the distributed one (that's available via the AspNetCoreRateLimiter package) and also specify which distributed cache implementation we want to use:

services.AddSingleton<IRateLimitCounterStore, DistributedCacheRateLimitCounterStore>();
services.AddStackExchangeRedisCache(options =>
{
    options.Configuration = "localhost";
    options.InstanceName = "SampleInstance";
});
 services.AddDistributedRedisCache(options =>
                {
                    // e.g. contoso5.redis.cache.windows.net,abortConnect=false,ssl=true,password=...
                    options.Configuration = Configuration["IpRateLimitingRedisString"];
                    options.InstanceName = "ip_rate_limit_";
                });

                // needed to store rate limit counters and ip rules
                services.AddMemoryCache();

                //load general configuration from appsettings.json
                services.Configure<IpRateLimitOptions>(Configuration.GetSection("IpRateLimiting"));

                //load ip rules from appsettings.json
                services.Configure<IpRateLimitPolicies>(Configuration.GetSection("IpRateLimitPolicies"));

                // inject counter and rules stores
                services.AddSingleton<IIpPolicyStore, DistributedCacheIpPolicyStore>();
                services.AddSingleton<IRateLimitCounterStore, DistributedCacheRateLimitCounterStore>();

@cristipufu like it?

But it not work, distribute cache concurrent problem also.

There's also another possible solution (also overriding the ClientRateLimitProcessor.ProcessRequestAsync() method), one could implement the pattern described here: https://redis.io/commands/INCR#pattern-rate-limiter

Just wanted to throw in some more information, I run this plugin on an API that was getting about 1 million requests per 4 hours. After getting complaints during high peak times about sluggishness, I removed this limiter and it went down from about 0.6sec per request to 0.15sec per request. I used a Redis backend for Cache as well. So there is defiantly something that needs to be done to help performance under high loads. Otherwise, for small projects this plugin totally rocks.

@STRATZ-Ken were you using the 3+ version? (the previous versions had a simple lock which means that each request would have throttled all the others)

The issue probably is this block of code:

using (await AsyncLock.WriterLockAsync(counterId).ConfigureAwait(false))
{
    var entry = await _counterStore.GetAsync(counterId, cancellationToken);

    if (entry.HasValue)
    {
        // entry has not expired
        if (entry.Value.Timestamp + rule.PeriodTimespan.Value >= DateTime.UtcNow)
        {
            // increment request count
            var totalCount = entry.Value.Count + _config.RateIncrementer?.Invoke() ?? 1;

            // deep copy
            counter = new RateLimitCounter
            {
                Timestamp = entry.Value.Timestamp,
                Count = totalCount
            };
        }
    }

    // stores: id (string) - timestamp (datetime) - total_requests (long)
    await _counterStore.SetAsync(counterId, counter, rule.PeriodTimespan.Value, cancellationToken);
}

As you can see this will hold a lock per ip/client/etc (depending on the plugin's configuration) that makes 2 Redis calls inside it. So if one would implement this atomic operation directly in Redis, replacing the lock + the 2 calls with a single Redis call, I think that the performance would improve a lot.

Any update on this?

Any plans to fix this?

Any plans to fix this?

The IDistributedCache service used to implement the distributed rate-limiting counter does not provide enough concurrency guarantees to resolve this race condition, and it likely never will. An atomic increment operation is needed, such as Redis' INCR command.

We resolved this issue by refactoring the IRateLimitCounterStore and backing it with a Redis cache, see the repo here. This also reduces per-request latency by eliminating the read/update/write operations that are the core of this issue (see here).

For each rate limit rule, time is divided into intervals that are the length of the rule's period. Requests are resolved to an interval, and the interval's counter is incremented. This is a slight change in semantics to the original implementation, but works in our use-case.

This approach requires a dependency on StackExchange.Redis to access the INCR command and key expiry, and the IConnectionMultiplexer service needs to be injected at startup.

We resolved this issue by refactoring the IRateLimitCounterStore and backing it with a Redis cache
This approach requires a dependency on StackExchange.Redis to access the INCR command and key expiry, and the IConnectionMultiplexer service needs to be injected at startup.

As far as I see, there is still race condition there, because call to Redis INCR command and a call to set expiration are different calls and counter may change in between. Looks like the only way to make it work correctly is to make it atomic using Lua script, exactly as explained in Redis documentation for INCR command.

As far as I see, there is still race condition there, because call to Redis INCR command and a call to set expiration are different calls and counter may change in between.

No race condition, it does not matter what the counter value is when the key expiry is set. The expiry only needs to be set once (but can be set any number of times), and the check for a counter value of 1 ensures this.

This is why the algorithm divides time into intervals of the rule's period. The expiry time will be the same for each interval, and any client may set it, as long as it is set at least once.

No race condition

Whatever is non-atomic (and your code is non-atomic) will produce race conditions.
All these cases can happen:
-your code gets frozen by garbage collector after creating counter and before setting expiration, so counter expires too late
-your code can crash after creating counter and before setting expiration because of OS shutdown or some other reason, so counter never expires
-network issue can prevent setting expiration after creating counter so counter never expires or expires too late

..and so on.

That's why Redis documentation says it needs to be done by Lua script which is atomic, Redis guys wrote it for a reason.

@olegkv the right place to discuss the fork is over at its repo, so I've created an issue there for this.

Iโ€™ve only recently found this repo and was wondering if there are known instances of using this in a production environment and how successful it has been.

Presumably the side effect of this issue is an incorrect request count. I think we can get by with that within a certain range.

Are there any other consequences of this that we should know?

@nick-cromwell I maintain a fork whose master branch is used in production. It has a known issue in a distributed deployment where one node may set a Redis key and then leave the network (e.g. crash) before it sets the key to expire. In this case, access to the rate-limited resource is eventually blocked. This is an extremely rare case, but is serious enough that it has been addressed in the atomic-lua branch, which is being tested and has not yet been used in production.

Thanks for that, @simonhaines. I'll keep a look out for a pull request from the atomic-lua branch. I'm not familiar with Lua script, but based on your discussion here with @olegkv it seems that will resolve this issue once and for all.

@simonhaines Are you good to make a PR from your atomic-lua branch? I think that would close out this issue, correct?

@nick-cromwell I don't think so. The fork changes the IRateLimitCounterStore interface in an incompatible way. Merging would require a major version number increase. It also makes the StackExchange Redis client a hard dependency. I'm not sure which way @stefanprodan wants to take this library, but these are big changes.

Hey @stefanprodan, what do you think to the changes proposed by @simonhaines? I'm wanting to use this package but holding off until the concurrency issues are sorted.

I was reading an recent article about how GitHub solved their rate limiting issues link with the actual Lua script at the end.

Does anyone consider this as a valid approach that would be worth looking into an possibly creating a pull request?

@validide there is a pull request #180 that incorporates distributed rate-limiting and it looks like a good way forward. However the proposed changes require an increase to the major version, and it looks like the project structure is going to be refactored with the distributed rate limited split out to a separate project.

Wish it would get merged and v bumped already though... it's been sitting a while

this should have high priority since for high load projects this issue is a blocker...

I realize this is an open-source project and everyone is contributing on their own time so is there any way I could help?

Thanks @validide. We're just waiting on #180 to be merged and the major version to be increased. Once #180 is merged I can open a PR for the StackExchangeRedisProcessingStrategy which resolves this.

We've pushed a new version 4.0.0 https://www.nuget.org/packages/AspNetCoreRateLimit/4.0.0 compatible with
the Redis extension which is available on NuGet here https://www.nuget.org/packages/AspNetCoreRateLimit.Redis/

@cristipufu, sorry to revive this, but wanted to ask something. I've read through all of this thread, and the related PRs, and the general documentation, but something is still unclear to me: is the concurrency issue identified originally here still a problem if you're using SQL Server for the distributed store since it still uses the AsyncKeyLockProcessingStrategy? Or was this a problem unique to Redis?

If it is still an issue for SQL Server cases, is the expectation here basically that as you scale up, you eventually need to switch over from SQL Server to Redis (and the new implementation that uses the direct Redis calls)?

Apologies again for reviving this just to ask, but I'm trying to understand the risk factor here of using SQL Server so that I can make a recommendation to my team.

@bkromhout

is the concurrency issue identified originally here still a problem if you're using SQL Server for the distributed store since it still uses the AsyncKeyLockProcessingStrategy

Yes

If it is still an issue for SQL Server cases, is the expectation here basically that as you scale up, you eventually need to switch over from SQL Server to Redis (and the new implementation that uses the direct Redis calls)?

I think that you can follow the Redis processing strategy example and implement something similar in SQL

https://github.com/stefanprodan/AspNetCoreRateLimit/wiki/Using-Redis-as-a-distributed-counter-store#solution-2---in-memory-cache-fallback

Can the docs be updated to provide an example of the in-memory fallback for the "New Version". Thanks.