grpc/grpc-node

Inferring `DEADLINE_EXCEEDED` message

pratik151192 opened this issue · 1 comments

Hello,

In a previous ticket we had requested for some additional details on stack traces during errors such as deadline exceeded. The additional information was added in this PR. Today I noticed one of the calls from our client spent 149 seconds in metadata filters which is way higher than our configured deadline values. My theory here is an overloaded client and event loop getting backed up, but I wanted to run by the trace anyway to see if I can infer anything else from it:

DEADLINE_EXCEEDED: Deadline exceeded after 149.406s,name resolution: 0.005s,
metadata filters: 149.399s,Waiting for LB pick

I went through the code where the Date for these are set and it seems like the DNS lookup completed (name resolution: 0.005s), the call started , got paused at the runtime for 150 seconds, woke back up, tried to reach the LB to fulfill the rpc but saw that the timer expired. Waiting for LB pick is kinda suspicious as that makes me think that the DNS resolution did not complete. I was also not 100% from this line whether configReceived means that the DNS lookup indeed completed? Curious to here your thoughts on this!

Yes, the config referred to in configReceived is the result of the DNS lookup, that's why it's considered the end of the name resolution time.

The deadline runs on a Node timer, not on checks at specific times, so if it doesn't fire at the set time, that means that the runtime wasn't even processing timers for that entire time. The most likely explanation is that either something else is completely blocking the event loop, or the process wasn't running at all.

When it says "Waiting for LB pick", that generally just means that the client hasn't established a connection to the server. In this case it could also mean that it didn't have a meaningful chance to try because the deadline timer fired immediately after it got to that step.