P100 lambda duration time spikes
frankenbubble opened this issue · 8 comments
Hi,
in recent 2 weeks we are getting a lot of p100 spikes on lambda duration times, and can't figure out the cause. we are on 6.0.8 and considering an upgrade to address the 3 vulnerabilities fixed in recent versions
However, one thing we noticed using x-ray on these spikes, is that response time is fast and duration is slow. Best I can tell , this implies that the code is doing some task after it's finished responding, and am wondering if anyone has ideas on how to trouble shoot further
Snip of screen shot from x-ray
https://docs.aws.amazon.com/xray/latest/api/API_TraceSummary.html states If the service performs work asynchronously, the response time measures the time before the response is sent to the user, while the duration measures the amount of time before the last traced activity completes.
Hi @frankenbubble,
That is very strange, I've attempted to reproduce that on my end but haven't been able to thus far. Can you provide some more information about your use case? How often are you sending requests to the lambda? Does the CloudWatch duration match with what Xray has recorded? This is for the BackendImageHandlerFunction, correct? What percentage of requests are seeing this sort of duration?
Simon
Hi Simon,
Yes, cloudwatch aligns to xray response time metric. We have 4 independents image handlers, and all are exhibiting similar behavior's, and all the same pattern where the behavior changes 09-02 and again 09-06. They have nothing in common, which makes me thing its something external, but we can't figure it yet. (4 unique sites, 4 domains, different code (php, java etc))
our use case is serving jpgs/webps resized from jpg sources, and signed urls. Our cache hit rates and request rates all look normal
Hi @frankenbubble,
I'm seeing some similar behaviour for other deployments in the eu-west-1 region, though the durations are not nearly as exaggerated (About 100x greater duration to response rather than 4000x). Are all 4 of your deployments in eu-west-1? I'll continue to do some investigation and reach out here once I've made some progress.
Simon
Yeah, all eu-west-1. Also high percent of our traffic would also be cloud-front pops in same location.
I know there are anonymous stats sent, but a brief look in the code made me think it was during deployment not operating. If they were during operation and the collection pipeline was a problem it could explain why duration is longer than response
@simonkrol This is an x-ray for a 13 second response, and corresponding logs.
I don't see anything helpful here, I think we would have to add additional logging to see the timings of the steps in the function. i.e. time to get object from s3 , time to resize etc. We don't have capacity to do that extra logging in the coming weeks, and as it's a P100 issue, it's not customer impacting enough to prioritize it over other work right now .
Any other thoughts, otherwise might need to stick a pin in it
Hmm, those aren't the same sort of issues you were initially hitting right? Those seem to actually have extended response times rather than just extended durations. Based on the timing of the logs that do exist, the imageRequest.setup(event) step is what's taking much longer than normal, I don't see anything in there that jumps out as being a potential cause for this. Have you noticed any sort of similarities in the values of the slow requests?
Closing due to inactivity. Please feel free to reopen if it's still an issue.