Started receiving "unable to verify the first certificate" when interacting with the sdk

Question

Started receiving "unable to verify the first certificate" when interacting with the sdk

nodkrot opened this issue 2 months ago · 43 comments

Versions

What package version of the SDK are you using. (botbuilder@4.16.0)
What nodejs version are you using (v16.20.2)
What os are you using (Mac)

Describe the bug

On July 3rd we started receiving "unable to verify the first certificate" error when starting and using botbuilder sdk APIs. It appears to happen to some but not all customers.

To Reproduce

Instantiate and use any operation botbuilder sdk (send or update message for example)

Expected behavior

No error

"exception":{"message":"unable to verify the first certificate","stack":"Error: unable to verify the first certificate\n    at new RestError (/var/www/app/node_modules/@azure/ms-rest-js/lib/restError.ts:18:5)\n    at AxiosHttpClient.<anonymous> (/var/www/app/node_modules/@azure/ms-rest-js/lib/axiosHttpClient.ts:194:15)\n    at step (/var/www/app/node_modules/@azure/ms-rest-js/node_modules/tslib/tslib.js:141:27)\n    at Object.throw (/var/www/app/node_modules/@azure/ms-rest-js/node_modules/tslib/tslib.js:122:57)\n    at rejected (/var/www/app/node_modules/@azure/ms-rest-js/node_modules/tslib/tslib.js:113:69)\n    at runMicrotasks (<anonymous>)\n    at processTicksAndRejections (node:internal/process/task_queues:96:5)"},"level":"error","message":"Failure updating message: unable to verify the first certificate"

Answer 1 · 2024-07-05T18:49:55.000Z

I'm getting the same error right now.

Answer 2 · 2024-07-05T20:17:15.000Z

I'm getting the same error right now.

Answer 3 · 2024-07-07T20:54:35.000Z

Getting the same error intermittently too!

Answer 4 · 2024-07-08T02:16:25.000Z

we are receiving same error as well

Answer 5 · 2024-07-08T05:41:07.000Z

We are seeing the same on our two apps as well.

Answer 6 · 2024-07-08T13:20:18.000Z

The OP indicates 4.16, which is quite old. What about the others in this thread?

Answer 7 · 2024-07-08T13:29:59.000Z

I'm on 4.22.2

Answer 8 · 2024-07-08T14:13:38.000Z

Same issue for us.

First occurrence: 2024-07-03 03:37:56 pm PDT.

Since then 290 further occurrences.

For the time being we're catching this error and simply retrying each request up to 3x and virtually all succeed. However we'd prefer to not have to do this.

Answer 9 · 2024-07-08T14:35:59.000Z

we are seeing issue on 4.22.3 as well ..

Answer 10 · 2024-07-08T14:50:19.000Z

Since this is happening on older versions, it's probably not related to a recent change in SDK. I'll check for issues with the other end. I'm assuming its failing during the request, especially since retries make it work.

I'm wondering what the HTTP response status code is. Is it being throttled? Though that is a specific status code and handled automatically. We can help mitigate by increasing our retries.

Answer 11 · 2024-07-08T14:56:58.000Z

Since this is happening on older versions, it's probably not related to a recent change in SDK. I'll check for issues with the other end. I'm assuming its failing during the request, especially since retries make it work.

I'm wondering what the HTTP response status code is. Is it being throttled? Though that is a specific status code and handled automatically. We can help mitigate by increasing our retries.

We receive 500 if I'm not mistaken. Since it's intermittent, it feels like there is a load balancer pointing to a server with an expired certificate. My guess

Answer 12 · 2024-07-08T14:57:40.000Z

The trace provided in this issue (third comment for requests.exceptions.SSLError) seems to indicate it's occurring with Python as well:

https://answers.microsoft.com/en-us/msteams/forum/all/unabletoverifyleafsignature-traffic-manager-teams/3ad8a8cb-bb41-4c46-ae4e-e31d0688b06e

In which case it may have nothing to do with bot-framework, which seems likely given these type of certificate issues generally originate on the server.

I have a support ticket open with Microsoft, however I've been informed that I don't have the correct tier of paid support to have anyone from engineering take a look.

If anyone has premium support I'd appreciate them raising the issue through a ticket.

Answer 13 · 2024-07-08T15:34:15.000Z

I'm wondering what the HTTP response status code is. Is it being throttled? Though that is a specific status code and handled automatically. We can help mitigate by increasing our retries.

I think it's actually a success response code. Our 429/5XX logic wasn't catching this error. We had to add some custom logic to catch-and-retry it via matching on e.message.includes("unable to verify the first certificate"). As far as I can tell, the server believes the request is valid.

Answer 14 · 2024-07-08T16:27:49.000Z

Hi,

We started facing this error with our Teams App on 3/Jul/2024 and it has already impacted an important release. Before this, our App has been running fine since Dec'23. We are using botbuilder@4.20.0 on node.js@18.20.3.

The exception has the format: FetchError: request to <URL> failed, reason: unable to verify the first certificate

type: system
errno, code: UNABLE_TO_VERIFY_LEAF_SIGNATURE

The target is of the form: "https://smba.trafficmanager.net/amer/v3/conversations/.../activities/..."

We face this intermittently when our Teams app sends or updates an adaptive card using the sendActivity or updateActivity methods of the TurnContext object in the Bot Builder SDK for Node.js.

We were able to replicate this issue in non-production and usually see it occur for 10% of the send/update attempts. Handling the exception and retrying the activity works on the first attempt. I think you should be able to fix this by identifying and fixing the failing certificate chain on the concerned nodes of your distributed infrastructure.

Looking forward to a quick fix! Thanks.

Answer 15 · 2024-07-08T16:35:30.000Z

Are these all Teams bots? 4.21.0 is when JS SDK moved to MSAL auth (from ADAL). But 4.16 would still be ADAL of course.

Answer 16 · 2024-07-08T16:42:31.000Z

Ours is a Teams Bot.

Answer 17 · 2024-07-08T17:42:41.000Z

This is being actively investigated by the Teams group.

Answer 18 · 2024-07-08T18:04:03.000Z

Also experiencing this with a teams bot running 4.21.1

Answer 19 · 2024-07-08T18:20:29.000Z

There has been at least two Sev 2's raised in that group. That is Microsoft terminology for high impact issue. It also means there will be eyes on it. I'll leave this open and post updates as I get them.

Answer 20 · 2024-07-08T21:16:30.000Z

Same issue with two of our application.

Answer 21 · 2024-07-09T01:10:44.000Z

Hey team, do you still see the issues? Could you please let us know if you still see the issues?

Answer 22 · 2024-07-09T02:06:47.000Z

Hey team, do you still see the issues? Could you please let us know if you still see the issues?

The last instance we observed was at 2024-07-08 04:00:17 pm PDT (~3 hours ago).

However, since it's intermittent and our app traffic volume is lower in the evening, we'll need to wait longer to be sure

Answer 23 · 2024-07-09T02:37:20.000Z

Thank you for sharing @at1as! For others, please do share whether you still see the issues or not with me here :)

One more question to the group: could you share your end point you are targeting like @gitnavneet did?
I am mainly interested in the region that the end point contains.
ex: "https://smba.trafficmanager.net/**amer**/v3/conversations/.../activities/..."

Hi,

We started facing this error with our Teams App on 3/Jul/2024 and it has already impacted an important release. Before this, our App has been running fine since Dec'23. We are using botbuilder@4.20.0 on node.js@18.20.3.

The exception has the format: FetchError: request to failed, reason: unable to verify the first certificate

type: system

errno, code: UNABLE_TO_VERIFY_LEAF_SIGNATURE

The target is of the form: "https://smba.trafficmanager.net/amer/v3/conversations/.../activities/..."

We face this intermittently when our Teams app sends or updates an adaptive card using the sendActivity or updateActivity methods of the TurnContext object in the Bot Builder SDK for Node.js.

We were able to replicate this issue in non-production and usually see it occur for 10% of the send/update attempts. Handling the exception and retrying the activity works on the first attempt. I think you should be able to fix this by identifying and fixing the failing certificate chain on the concerned nodes of your distributed infrastructure.

Looking forward to a quick fix! Thanks.

Answer 24 · 2024-07-09T02:41:04.000Z

Haven't seen the issue in the last few hours, but, will wait and see how things go tomorrow!

Our region endpoint was amer

Answer 25 · 2024-07-09T08:33:55.000Z

Hey team, do you still see the issues? Could you please let us know if you still see the issues?

Hey @YunnyChung,

We tried replicating the issue again in non-production today, however, it has not recurred! 😄

In summary,

We faced a total of 70 errors in production some of which caused a bad UX.
The first exception occurred on 3-Jul-24 at 21:56:45 UTC.
The last exception was ~12 hours ago on 8-Jul-24 at 20:35:26 UTC.
The target endpoint was always in the amer region.

We will continue monitoring the production logs for a few more days and report back here if the issue recurs.

🤔 Could you please share the root cause analysis? Thanks!

Answer 26 · 2024-07-09T12:00:56.000Z

Errors stopped for as well

Answer 27 · 2024-07-09T13:02:34.000Z

Thank you so much everyone for sharing the information here! Yes, please do let me know whether errors have stopped or still occurring. We will share the root cause analysis when it is ready.

One another small favor I want to ask to this community:
For those who encountered this issue, are you using Linux? Could you share the OS of the machine you used when this issue occurred?

Answer 28 · 2024-07-09T14:07:42.000Z

Thank you so much everyone for sharing the information here! Yes, please do let me know whether errors have stopped or still occurring. We will share the root cause analysis when it is ready.

One another small favor I want to ask to this community: For those who encountered this issue, are you using Linux? Could you share the OS of the machine you used when this issue occurred?

Our solution is deployed as an Azure Function app running on Windows.

Answer 29 · 2024-07-10T13:32:51.000Z

Thank you so much everyone for sharing the information here! Yes, please do let me know whether errors have stopped or still occurring. We will share the root cause analysis when it is ready.

One another small favor I want to ask to this community: For those who encountered this issue, are you using Linux? Could you share the OS of the machine you used when this issue occurred?

I didn't see any errors yesterday, thanks for that! However, today we received a new one, probably related:

{
  "url": "https://smba.trafficmanager.net/amer/api/v2/[redacted]/channel/messages/[redacted]"
  "response": "read ECONNRESET"
}

Answer 30 · 2024-07-10T13:39:05.000Z

We saw this error once too, today at 6:15:57 UTC.

{
    "level": "error",
    "message": "An error occurred while handling the event FetchError: request to https://smba.trafficmanager.net/amer/v3/conversations/.../activities/... failed, reason: read ECONNRESET"
}

Answer 31 · 2024-07-10T13:52:38.000Z

Thank you so much everyone for sharing the information here! Yes, please do let me know whether errors have stopped or still occurring. We will share the root cause analysis when it is ready.

One another small favor I want to ask to this community: For those who encountered this issue, are you using Linux? Could you share the OS of the machine you used when this issue occurred?

Our bot is deployed as an service in Google Cloud Functions

Answer 32 · 2024-07-10T13:55:22.000Z

@FBraz-RMFarma

Our bot is deployed as an service in Google Cloud Functions

No kidding? That's pretty cool.

Answer 33 · 2024-07-10T13:59:46.000Z

I saw the ECONNRESET as well yesterday.

Answer 34 · 2024-07-10T14:06:18.000Z

We saw this error once too, today at 6:15:57 UTC.

{
    "level": "error",
    "message": "An error occurred while handling the event FetchError: request to https://smba.trafficmanager.net/amer/v3/conversations/.../activities/... failed, reason: read ECONNRESET"
}

Yeah, we got 223 issues in the last hour 😞

Answer 35 · 2024-07-10T14:24:57.000Z

Yeah, we got 223 issues in the last hour 😞

@avilabiel , now you have me worried! 😮 How did those 223 new issues impact your app or end users?

For our app, it caused an ephemeral error message, but the action succeeded with a retry.

Answer 36 · 2024-07-10T15:21:53.000Z

Thank you everyone for sharing the issue here 🙏🏻
For those who started to see the error, could you help me one more time by sharing the following information?

Is this issue still happening? If so, how often does this issue occur? (ex: All requests are failing or some? If this happens intermittently, how often? (ex: 1%, 5%, 10% of the requests etc))
When did you start to see the issue? (date with approximate time (with timezone) will be super appreciated 🙏🏻)

If the issue stopped happening, could you share the last time you saw the issue? (again date with approximate time will be super helpful here)

Could you share the end point with the region information?
ex: https://smba.trafficmanager.net/amer/v3/conversations/.../activities/...

Answer 37 · 2024-07-10T15:51:00.000Z

Latest: 2024-07-10 08:00:42 am PDT (50 mins ago)
First: 2024-07-09 08:00:03 pm PDT
Occurrences: 37

Our app is heavily skewed to triggering on the hour, for certain hours, so ~50 minutes of no occurrences does not tell us much. It is intermittent, but I don't have immediate counts.

We see it for this endpoint (though it's our most heavily used, so this isn't necessarily an exhaustive list):

https://smba.trafficmanager.net/amer/v3/conversations", "method":"POST"

Answer 38 · 2024-07-10T21:29:35.000Z

We are in active process for investigation & mitigation -quick two follow up questions here:

(1) For those failed requests, do you see those requests' responses containing header with the ms-cv?
(2) Could you let me now if you see the decrease of the errors starting 07/10/24 8:30 pm in UTC?

Answer 39 · 2024-07-10T22:48:23.000Z

@YunnyChung ,

The read ECONNRESET error occurred multiple times for our app today (10/Jul/24). Some users missed important updates. So, we have postponed the app launch to a wider group.

(1) I did not notice any header with ms-cv in our logs.
(2) Yes, errors have decreased. They have not occurred since 19:05:49 UTC.

Some error timestamps (UTC): 6:15:57 (first), 14:16:51, 14:19:28, 15:21:01, 16:50:57, 19:05:49 (last).
Target: https://smba.trafficmanager.net/amer/v3/conversations/.../activites/...

Answer 40 · 2024-07-15T12:39:42.000Z

Hey guys,
please let me know if the problems have stopped?

Answer 41 · 2024-07-15T13:01:08.000Z

Hey guys, please let me know if the problems have stopped?

What I've been seeing on our end, is that our bot is now able to send a message to a channel, then reply to iself, and update its own messages. However, the bot is still not able to reply to user messages, both in private or in a channel.

Not working: User messsage > Reply
Working: User message > Create new message in channel > Reply to that new message > Edit the reply

Answer 42 · 2024-07-15T14:55:15.000Z

I have NOT seen any read ECONNRESET errors since 10/Jul/24, 19:05:49 UTC. 🎉

Answer 43 · 2024-07-17T12:45:14.000Z

Our bot is able to reply to users again. All of our issues are now resolved. 🎉