aws-powertools/powertools-lambda-typescript

Feature request: In Batch Processing, allow sending requests directly to DLQ

Opened this issue · 9 comments

Use case

When handling events from SQS and failing a message, not all errors are the same. E.g. if the event fails validation, then no amount of retrying will get it to pass. But throwing an error for it will lead to retrying the error like any other.
Instead, the event should end up directly on the SQS queues DLQ so that I can inspect and potentially redrive it without getting my logs spammed with errors from the retries.

Solution/User Experience

 import {
   SQSBatchProcessor,
   EventType,
   processPartialResponse,
   NonRetriableError,
 } from '@aws-lambda-powertools/batch';
 import type { SQSHandler, SQSRecord } from 'aws-lambda';

 const processor = new SQSBatchProcessor({ dlqUrl: process.env.DLQ_URL }); 

 const recordHandler = async (record: SQSRecord): Promise<void> => { 
   throw new NonRetriableError();
 };

 export const handler: SQSHandler = async (event, context) =>
   processPartialResponse(event, recordHandler, processor, { 
     context,
   });

The processor needs to be configured with the DLQ url so it can send messages there. This also avoids any breaking change (even though just defining our own NonRetriableError should be enough already for this). It's a bit weird to add options to BatchProcessor as these options are specific just for the SQS type (though theoretically this might even make sense for other event sources).

Alternative solutions

Currently, I'm just accepting that my logs will get spammed in case of one of these errors.

Acknowledgment

Future readers

Please react with 👍 and your use case to help us understand customer demand.

Hi @dbartholomae, thank you for opening this feature request.

The Batch Processor doesn't interact with SQS (or other) APIs directly but instead uses partial batch responses to tell the event source which items failed by having your function handler return a JSON object with the IDs of the failed items.

The feature you're describing would require us to take a dependency on the AWS SDK SQS Client which would have an impact on both cost and performance.

This might be a good tradeoff assuming the feature is useful to customers, but before implementing this we'll definitely want to hear from more customers to see if the use case is general enough.

In the meantime, you could implement this logic in your code with something like this:

import {
   BatchProcessor,
   EventType,
   processPartialResponse,
} from '@aws-lambda-powertools/batch';
import type { SQSHandler, SQSRecord } from 'aws-lambda';

const processor = new BatchProcessor(EventType.SQS); 
const nonRetriableMessages = [];

const recordHandler = async (record: SQSRecord): Promise<void> => { 
  if (nonRetriableCondition) {
    nonRetriableMessages.push(record);
    // returning here means the item gets marked as successful which causes it to be removed from the event source
    return;
  }
};

export const handler: SQSHandler = async (event, context) => {
  const batchFailures = await processPartialResponse(event, recordHandler, processor, { 
     context,
   });

  // send your messages to DLQ here

  return batchFailures;
}

Finally, if the issue is just having noisy logs, might I suggest looking into log buffering?


For others bumping into this issue: if this feature is interesting to you, please add a 👍 to the original post at the top or leave a comment to describe your use case. Our roadmap is influenced primarily by demand, so having clear signals from customers helps us decide what to work on.

Thanks! Yes, an implementation user land definitely is possible, though implementing it as a custom catch processor would be cleaner.

Regarding the dependency: since AWS SDK 3 this should be tree-shakable, but another option would be to inject the client into the processor so that the processor and this library would only need to depend on SQS types on compile-time.

And log-buffering doesn't really help, as there will be at least one error per retry per event which might not even occur in the same lambda invocation.

But like for my last issue (which I think is currently the 4th-highest voted in this repo) I do agree that it can make sense to wait for more user input here :)

@leandrodamascena thoughts on this feature request?

Hi @dbartholomae, thank you so much for opening this issue. I love discussions about the BatchProcessor utility. EDA is my passion ❤️

Use case

When handling events from SQS and failing a message, not all errors are the same. E.g. if the event fails validation, then no amount of retrying will get it to pass. But throwing an error for it will lead to retrying the error like any other.
Instead, the event should end up directly on the SQS queues DLQ so that I can inspect and potentially redrive it without getting my logs spammed with errors from the retries.

 import {
   SQSBatchProcessor,
   EventType,
   processPartialResponse,
   NonRetriableError,
 } from '@aws-lambda-powertools/batch';
 import type { SQSHandler, SQSRecord } from 'aws-lambda';

 const processor = new SQSBatchProcessor({ dlqUrl: process.env.DLQ_URL }); 

 const recordHandler = async (record: SQSRecord): Promise<void> => { 
   throw new NonRetriableError();
 };

 export const handler: SQSHandler = async (event, context) =>
   processPartialResponse(event, recordHandler, processor, { 
     context,
   });

This is a really good point, and it reminds me of the poison pill messaging issues I've often encountered when working with event-driven architecture. But I'm wondering how this differs from the DLQ mechanism that SQS and Lambda come with out-of-the-box. As @dreamorosi mentioned, implementing it in Powertools will cost SDK imports and API calls, while Lambda/SQS offers this mechanism for free after some retries (it can be 1). Also, you mentioned that not all errors are the same, but in your example, you simply provide an SQS queue to send the failed message, which means that regardless of the error (which can transient), the failed message will end up in the DLQ. Why not to use the builtin-in mechanism in Lambda/SQS and set the error retry to 1??

The processor needs to be configured with the DLQ url so it can send messages there. This also avoids any breaking change (even though just defining our own NonRetriableError should be enough already for this). It's a bit weird to add options to BatchProcessor as these options are specific just for the SQS type (though theoretically this might even make sense for other event sources).

Yes, this is a very specific use case and I think we need to guide customers to use the built-in mechanism provided by SQS and Lambda.

I vote to keep it open for a while, and if we get enough reactions, we can discuss to make it more specific to specific errors exceptions rather than generic ones.

Thanks

If I understand the feature request correctly, customers need to throw/raise a specific kind of exception that is set by us (notice the NonRetriableError import), so they can tell the BatchProcessor that a specific item should not be retried but sent to the DLQ directly.

This also assumes that customers will implement logic in the record handler to handle their own errors and decide to either let it raise/throw (which will cause the item to be marked as failed) or handle it and throw a NonRetriableError.

From the BatchProcessor side, whenever we see a NonRetriableError thrown by a record handler, we don't mark the item as failed (but not successful either) so the ESM removes it from the source Queue.

Then at some point before returning the response, we need to make this call to sent the messages, which will also need to take chunk sizes in account.

Also, what happens when there's an error of any kind in that call to publish the messages to the DLQ? At that point we can only retry but there's a chance of data loss if the issue is not retriable, which is kinda bad.

Oops, I missed the NonRetriableError part, which is good because we can identify the error. But I also see your point @dreamorosi about not being able to put a message in the SQS queue. Also, the DLQ mechanism is a free-nice-stuff for customers working with Lambda and EDA, so I think customers will get much more benefit from taking advantage of this than from making API calls in Powertools.

But again, let's keep this open to see what customers think.

Why not to use the builtin-in mechanism in Lambda/SQS and set the error retry to 1?

I think this was already answered, but to reiterate: almost all event handlers typically have two very different error cases. One are transient errors, e.g. when the handler calls an external service and there's a network hiccup. These can be resolved simply by retrying. The other are permanent errors, e.g. validation errors on the message or misconfiguration inside the handler. Retrying these leads to unnecessary cost and noise and they can easily explode as e.g. one error from misconfiguration usually means that all messages will fail.

Currently, SQS offers a built-in way to correctly deal with either case, but not with both at the same time, which is unfortunate, as handlers typically have both error cases.

I've also found this related discussion on Java PowerTools:

aws-powertools/powertools-lambda#21

Also, what happens when there's an error of any kind in that call to publish the messages to the DLQ? At that point we can only retry but there's a chance of data loss if the issue is not retriable, which is kinda bad.

That should not be a problem even with a naive implementation: the error from the SQS call will bubble up, causing the lambda to fail and putting the whole batch on retry again.

If we put the SQS in a try-catch, we can just fail this one message and have it retried until it ends on the DLQ.

That should not be a problem even with a naive implementation: the error from the SQS call will bubble up, causing the lambda to fail and putting the whole batch on retry again.

Yes, but there are important caveats here, which is why I was suggesting to leave the actual partial failure response intact.

While this is technically correct, throwing will degrade your service.

Especially so, since we're talking about non-retriable errors which might cause unhandled exceptions multiple times.

When you set a Lambda function to consume a queue, there's a 3rd component called Event Source Manager (ESM) which basically acts as a stateful poller between the two.

To avoid overwhelming the function, the ESM intelligently forwards messages at a rate compatible with the function current scale rate, and uses failure metrics to slow down.

Whenever your function has unhandled exceptions, the ESM interprets this as a signal that the consumer has become degraded and slows down the flow of messages sent to it.

In the example we're describing, the entire batch will be sent back, potentially multiple times causing this to happen.

Even if this is an acceptable tradeoff, the next challenge will be finding the right batch size while sending to the DLQ.

When using the SendMessageBatch operation, you can send up to 10 messages with a size of 1 MiB for both individual message size and the maximum total payload size (aka the sum of the individual lengths of all of the batched messages).

Since we don't know how customer messages look like, we'll have to either spam retries until it works (adding latency) or implement logic that takes these limitations in account and creates chunks accordingly.

Sure, but you can solve this with a less-naive implementation where you just put the message as failed back in the retry if the SQS client fails. Anyways it would be less problematic than just retrying a message that is set to fail multiple times.