Review Poller Concurrency Design
ashovlin opened this issue · 1 comments
It might be worth comparing notes on this stuff, we had a similar poller implementation in JustSaying 5 and moved away from it. I did a write-up about the problems it had, some of it was due to the implementation, but in general an eager check of "how many free workers do we have" the moment one has finished leads to choppy message consumption when there's back-pressure to work through, as well as all of the complications you are facing about keeping count accurately (which is essential).
The write-up is here (that ones more about the shared throttling mechanism we had) and discussion here (this describes the choppiness issue which is one of the factors that lead to a Channels based multiplexer).
We switched out to a mechanism that pre-fetches 1 batch lazily, with the message visibility renewal process you have it might be a better approach. I see you have the possibility to allow replacing the poller, it would be handy to us if the abstraction didn't necessitate the presence of anActiveMessageCount
.
What we have by default now essentially means as soon as 1 message is being processed from a batch, the next batch is fetched. The downside is naturally that you might take a message earlier than you are ready to process it, but again the visibility renewal takes care of that.
I'd also suggest reviewing the default visibility timeout and renewal values. 20 seconds and 18 for renewal might be cutting it fine. 30 and 25 gives a bit more room for error.
Originally posted by @slang25 in #35 (comment)
I'd like to merge the PR that comment was left on sooner to fix a bug in the current implementation to unblock other areas of work, so creating this to track and discuss revisiting the poller design. DOTNET-6916 tracks this in the internal issue tracker.
We had an internal doc readout that explains how JustSaying manages concurrency during message handling and why did they move away from a concurrency strategy that is similar to what AWS Message Processing Framework for .NET uses currently.
JustSaying has a concept called SubscriptionGroups which is a collection of multiple SQS queues with the same shared configuration. By default, each queue has its own SubscriptionGroup. When a concurrency limit is set on a subscription group, this limit is shared across all SQS queues belonging to the same group. This means that if the concurrency limit is set to 10, then the combined maximum number of concurrently handled messages across all queues would be 10.
SubsciptionGroups are based on Channels based data structures.
Benefits of SubscriptionGroups
Subscription group offers throttled access to a shared resource that is being used by multiple handlers concurrently. Imagine a scenario in which you have several queues whose handlers all hit the same database. To be resilient under load, it's important to not overload the database by allowing too many requests through. Using subscription groups would put a concurrency limit that is shared across all handlers.
Drawbacks of Using Subscription Groups
Subscription groups induce a constant backpressure of messages that is governed by the WithPrefetch extension method. This controls the number of messages that should be read from SQS each time it makes a call. Each SQS queue within the group will prefetch 1 batch of messages during the first iteration. Subsequently, as soon as 1 message is being processed from a batch, the next batch is fetched pre-emptively.
The downside to this approach is that during each polling request, the user is receiving more messages than what can be consumed. These messages are buffered until a worker is available and their visibility is constantly being monitored and extended in the background. This can have business impact for customers since requests for extending visibility of messages are chargeable. This drawback was also mentioned in one of the PR comments left by a maintainer of JustSaying. Concurrency management in subscription groups is decoupled from the message polling logic. Concurrency limits will only restrict the number of handlers being invoked across all SQS queues but it does not influence the number of messages that will be fetched from SQS during each service call.
Additionally, messages that are polled from multiple SQS queues are buffered asynchronously and are interleaved. Since there are no guarantees on ordering of messages, JustSaying does not support FIFO handling of messages and also has an open GitHub issue about it.
Why did JustSaying opt to use a Channels based approach to concurrency management?
This GitHub issue outlines the concurrency related issues in JustSaying prior to V7. Below is a summary of it:
The had an interface called IMessageProcessingStrategy with the following members
int MaxWorkers { get; }
int AvailableWorkers { get; }
void StartWorker(Func<Task> action);
Task WaitForAvailableWorkers();
Imagine a scenario where a single instance IMessageProcessingStrategy
was shared across 2 SQS pollers.
- T0:
AvailableWorkers
is 0. Both pollers are blocked onWaitForAvailableWorkers()
- T1:
AvailableWorkers
is set to 1 and both pollers attempt to read this value. - T2: Poller 1 reads the value first and requests SQS for 1 message and dispatches it for processing, decrementing
AvailableWorkers
to 0 - T3: Poller 2 still has the stale value of
AvailableWorkers
in memory and requests 1 more message and eventually dispatches it for processing but also violates the concurrency limit.
They did not have a thread-safe way to maintain a count of AvailableWorkers
across multiple SQS pollers and therefore moved away from it.
Conclusion
There are no user reported issues with the concurrency strategy currently used by AWS Message Processing Framework for .NET. It is stable with extensive test coverage around it. We currently do not plan to support setting a concurrency limit that is shared across multiple SQS queues and therefore have decided to stick to our current concurrency management startegy.