Enabling analytics leads to timeouts in response-streaming AWS Lambda

Question

Enabling analytics leads to timeouts in response-streaming AWS Lambda

unstubbable opened this issue 9 months ago · 8 comments

Follow-up from #89. Turns out, the analytics are the cause for the task timeouts, not the pending promise.

Additional info: The affected Lambda functions do use response streaming. ~~I haven't tested classic Lambdas with analytics yet.~~ (Update: Only response-streaming Lambdas are affected.) The response stream is ended properly, but the task does not complete and runs into a timeout.

A related PR is #91. This interfered with my observations because I wasn't even aware that I was switching from enabled to disabled analytics in one of my changes.

Answer 1 · 2024-03-18T08:29:11.000Z

It looks like you've found a temporary fix for the issue. If an immediate fix for analytics isn't necessary, I'll postpone addressing this issue for now since we're currently short on resources. Additionally, it would be helpful if you could share a sample code to make it easier to fix in the future.

Answer 2 · 2024-03-18T10:16:46.000Z

Here's a repro: https://github.com/unstubbable/upstash-ratelimit-analytics-lambda-timeout

The timeouts do not occur when analytics is disabled.
They also do not occur with analytics enabled, and using a non-streaming handler.
With analytics enabled, not only does the timeout occur, but the response times are also significantly higher (approx. 2–3s vs. 150ms)
Awaiting the pending promise does not help.

Answer 3 · 2024-03-19T13:08:58.000Z

Hey, @unstubbable, I haven't found the underlying issue, but I'm still trying to figure out.

Answer 4 · 2024-04-16T07:07:22.000Z

Hi @unstubbable,

I tried your lambda and saw that the requests timeout like you said. Long story short, I found a solution but didn't figure out why this occurs in the first place.

I will make sure to add an example for streaming lambda or update our docs.

Solution

I checked whether the analytics were submitted correctly to redis before the Lambda timeout. They are succesfully submitted.
We also receive a succesful response from the API, so the code in our handler executes properly

But we still get timeout. I tried to find if there were other people who experienced similar issues and found some posts [1, 2, 3, 4].

What I kept seeing was the callbackWaitsForEmptyEventLoop parameter in the Lambda nodejs context. I set the parameter to false in the handler you provided and the timeouts dissappeared.

import {Ratelimit} from '@upstash/ratelimit';
import {Redis} from '@upstash/redis';
import type {APIGatewayProxyEventV2} from 'aws-lambda';

export const handler = awslambda.streamifyResponse<APIGatewayProxyEventV2>(
  async (event, responseStream, context) => {
    console.log(JSON.stringify({event}));

    context.callbackWaitsForEmptyEventLoop = false
    // ...

I looked into why we need to set this parameter to false in our case but I couldn't figure that out yet. After this change, there is no more timeouts.

Checking Latency

After verify the solution, I checked the time it takes for the requests to complete. The request time reduced as we expect after the fix.

Pre Fix

--- Successful Requests Statistics ---
Mean Response Time:         2.65 seconds
Minimum Response Time:      2.469 seconds
1st Quantile Response Time: 2.571 seconds
Median Response Time:       2.631 seconds
3rd Quantile Response Time: 2.666 seconds
Maximum Response Time:      3.271 seconds

--- Rate-Limited Requests Statistics ---
Mean Response Time:         2.63 seconds
Minimum Response Time:      2.508 seconds
1st Quantile Response Time: 2.563 seconds
Median Response Time:       2.638 seconds
3rd Quantile Response Time: 2.694 seconds
Maximum Response Time:      2.766 seconds

--- Summary ---
Total Successful Requests: 41
Total Rate-Limited Requests: 19

Post Fix

--- Successful Requests Statistics ---
Mean Response Time:         0.456 seconds
Minimum Response Time:      0.306 seconds
1st Quantile Response Time: 0.309 seconds
Median Response Time:       0.316 seconds
3rd Quantile Response Time: 0.425 seconds
Maximum Response Time:      1.353 seconds

--- Rate-Limited Requests Statistics ---
Mean Response Time:         0.31 seconds
Minimum Response Time:      0.289 seconds
1st Quantile Response Time: 0.299 seconds
Median Response Time:       0.301 seconds
3rd Quantile Response Time: 0.308 seconds
Maximum Response Time:      0.604 seconds

--- Summary ---
Total Successful Requests: 10
Total Rate-Limited Requests: 50

Answer 5 · 2024-04-16T07:17:10.000Z

Btw, I don't know for sure, there might be a floating promise in our core analytics packages. Thats why event loop waits until the end to execute the floating promises and sometimes not even executing them. That might be worth checking.

Answer 6 · 2024-04-17T15:12:25.000Z

Hi again @unstubbable,

We identified the reason why we had to set the callbackWaitsForEmptyEventLoop parameter to false.

In the core-analytics library, there is a cache which calls setInterval but clearInterval is never called. We are making changes to disable the cache in rate limit library and only enable it in the analytics dashboard in Upstash console which is the original purpose of the cache.

After this change, lambda timeouts stopped without having to set callbackWaitsForEmptyEventLoop to false.

Thanks for bringing this issue to our attention!

Answer 7 · 2024-04-24T11:39:19.000Z

Closing this issue since the changes were merged and version was bumped. Thanks again @unstubbable.