microsoft/botbuilder-js

Started receiving "unable to verify the first certificate" when interacting with the sdk

nodkrot opened this issue ยท 43 comments

Versions

What package version of the SDK are you using. (botbuilder@4.16.0)
What nodejs version are you using (v16.20.2)
What os are you using (Mac)

Describe the bug

On July 3rd we started receiving "unable to verify the first certificate" error when starting and using botbuilder sdk APIs. It appears to happen to some but not all customers.

To Reproduce

Instantiate and use any operation botbuilder sdk (send or update message for example)

Expected behavior

No error

"exception":{"message":"unable to verify the first certificate","stack":"Error: unable to verify the first certificate\n    at new RestError (/var/www/app/node_modules/@azure/ms-rest-js/lib/restError.ts:18:5)\n    at AxiosHttpClient.<anonymous> (/var/www/app/node_modules/@azure/ms-rest-js/lib/axiosHttpClient.ts:194:15)\n    at step (/var/www/app/node_modules/@azure/ms-rest-js/node_modules/tslib/tslib.js:141:27)\n    at Object.throw (/var/www/app/node_modules/@azure/ms-rest-js/node_modules/tslib/tslib.js:122:57)\n    at rejected (/var/www/app/node_modules/@azure/ms-rest-js/node_modules/tslib/tslib.js:113:69)\n    at runMicrotasks (<anonymous>)\n    at processTicksAndRejections (node:internal/process/task_queues:96:5)"},"level":"error","message":"Failure updating message: unable to verify the first certificate"

I'm getting the same error right now.

I'm getting the same error right now.

Getting the same error intermittently too!

we are receiving same error as well

We are seeing the same on our two apps as well.

The OP indicates 4.16, which is quite old. What about the others in this thread?

I'm on 4.22.2

Same issue for us.

First occurrence: 2024-07-03 03:37:56 pm PDT.

Since then 290 further occurrences.

For the time being we're catching this error and simply retrying each request up to 3x and virtually all succeed. However we'd prefer to not have to do this.

we are seeing issue on 4.22.3 as well ..

Since this is happening on older versions, it's probably not related to a recent change in SDK. I'll check for issues with the other end. I'm assuming its failing during the request, especially since retries make it work.

I'm wondering what the HTTP response status code is. Is it being throttled? Though that is a specific status code and handled automatically. We can help mitigate by increasing our retries.

Since this is happening on older versions, it's probably not related to a recent change in SDK. I'll check for issues with the other end. I'm assuming its failing during the request, especially since retries make it work.

I'm wondering what the HTTP response status code is. Is it being throttled? Though that is a specific status code and handled automatically. We can help mitigate by increasing our retries.

We receive 500 if I'm not mistaken. Since it's intermittent, it feels like there is a load balancer pointing to a server with an expired certificate. My guess

The trace provided in this issue (third comment for requests.exceptions.SSLError) seems to indicate it's occurring with Python as well:

In which case it may have nothing to do with bot-framework, which seems likely given these type of certificate issues generally originate on the server.

I have a support ticket open with Microsoft, however I've been informed that I don't have the correct tier of paid support to have anyone from engineering take a look.

If anyone has premium support I'd appreciate them raising the issue through a ticket.

I'm wondering what the HTTP response status code is. Is it being throttled? Though that is a specific status code and handled automatically. We can help mitigate by increasing our retries.

I think it's actually a success response code. Our 429/5XX logic wasn't catching this error. We had to add some custom logic to catch-and-retry it via matching on e.message.includes("unable to verify the first certificate"). As far as I can tell, the server believes the request is valid.

Hi,

We started facing this error with our Teams App on 3/Jul/2024 and it has already impacted an important release. Before this, our App has been running fine since Dec'23. We are using botbuilder@4.20.0 on node.js@18.20.3.

The exception has the format: FetchError: request to <URL> failed, reason: unable to verify the first certificate

  • type: system
  • errno, code: UNABLE_TO_VERIFY_LEAF_SIGNATURE

The target is of the form: "https://smba.trafficmanager.net/amer/v3/conversations/.../activities/..."

We face this intermittently when our Teams app sends or updates an adaptive card using the sendActivity or updateActivity methods of the TurnContext object in the Bot Builder SDK for Node.js.

We were able to replicate this issue in non-production and usually see it occur for 10% of the send/update attempts. Handling the exception and retrying the activity works on the first attempt. I think you should be able to fix this by identifying and fixing the failing certificate chain on the concerned nodes of your distributed infrastructure.

Looking forward to a quick fix! Thanks.

Are these all Teams bots? 4.21.0 is when JS SDK moved to MSAL auth (from ADAL). But 4.16 would still be ADAL of course.

Ours is a Teams Bot.

This is being actively investigated by the Teams group.

Also experiencing this with a teams bot running 4.21.1

There has been at least two Sev 2's raised in that group. That is Microsoft terminology for high impact issue. It also means there will be eyes on it. I'll leave this open and post updates as I get them.

Same issue with two of our application.

Hey team, do you still see the issues? Could you please let us know if you still see the issues?

Hey team, do you still see the issues? Could you please let us know if you still see the issues?

The last instance we observed was at 2024-07-08 04:00:17 pm PDT (~3 hours ago).

However, since it's intermittent and our app traffic volume is lower in the evening, we'll need to wait longer to be sure

Thank you for sharing @at1as! For others, please do share whether you still see the issues or not with me here :)

One more question to the group: could you share your end point you are targeting like @gitnavneet did?
I am mainly interested in the region that the end point contains.
ex: "https://smba.trafficmanager.net/**amer**/v3/conversations/.../activities/..."

Hi,

We started facing this error with our Teams App on 3/Jul/2024 and it has already impacted an important release. Before this, our App has been running fine since Dec'23. We are using botbuilder@4.20.0 on node.js@18.20.3.

The exception has the format: FetchError: request to failed, reason: unable to verify the first certificate

  • type: system
  • errno, code: UNABLE_TO_VERIFY_LEAF_SIGNATURE

The target is of the form: "https://smba.trafficmanager.net/amer/v3/conversations/.../activities/..."

We face this intermittently when our Teams app sends or updates an adaptive card using the sendActivity or updateActivity methods of the TurnContext object in the Bot Builder SDK for Node.js.

We were able to replicate this issue in non-production and usually see it occur for 10% of the send/update attempts. Handling the exception and retrying the activity works on the first attempt. I think you should be able to fix this by identifying and fixing the failing certificate chain on the concerned nodes of your distributed infrastructure.

Looking forward to a quick fix! Thanks.

Haven't seen the issue in the last few hours, but, will wait and see how things go tomorrow!

Our region endpoint was amer

Hey team, do you still see the issues? Could you please let us know if you still see the issues?

Hey @YunnyChung,

We tried replicating the issue again in non-production today, however, it has not recurred! ๐Ÿ˜„

In summary,

  • We faced a total of 70 errors in production some of which caused a bad UX.
  • The first exception occurred on 3-Jul-24 at 21:56:45 UTC.
  • The last exception was ~12 hours ago on 8-Jul-24 at 20:35:26 UTC.
  • The target endpoint was always in the amer region.

We will continue monitoring the production logs for a few more days and report back here if the issue recurs.

๐Ÿค” Could you please share the root cause analysis? Thanks!

Errors stopped for as well
image

Thank you so much everyone for sharing the information here! Yes, please do let me know whether errors have stopped or still occurring. We will share the root cause analysis when it is ready.

One another small favor I want to ask to this community:
For those who encountered this issue, are you using Linux? Could you share the OS of the machine you used when this issue occurred?

Thank you so much everyone for sharing the information here! Yes, please do let me know whether errors have stopped or still occurring. We will share the root cause analysis when it is ready.

One another small favor I want to ask to this community: For those who encountered this issue, are you using Linux? Could you share the OS of the machine you used when this issue occurred?

Our solution is deployed as an Azure Function app running on Windows.

Thank you so much everyone for sharing the information here! Yes, please do let me know whether errors have stopped or still occurring. We will share the root cause analysis when it is ready.

One another small favor I want to ask to this community: For those who encountered this issue, are you using Linux? Could you share the OS of the machine you used when this issue occurred?

I didn't see any errors yesterday, thanks for that! However, today we received a new one, probably related:

{
  "url": "https://smba.trafficmanager.net/amer/api/v2/[redacted]/channel/messages/[redacted]"
  "response": "read ECONNRESET"
}

We saw this error once too, today at 6:15:57 UTC.

{
    "level": "error",
    "message": "An error occurred while handling the event FetchError: request to https://smba.trafficmanager.net/amer/v3/conversations/.../activities/... failed, reason: read ECONNRESET"
}

Thank you so much everyone for sharing the information here! Yes, please do let me know whether errors have stopped or still occurring. We will share the root cause analysis when it is ready.

One another small favor I want to ask to this community: For those who encountered this issue, are you using Linux? Could you share the OS of the machine you used when this issue occurred?

Our bot is deployed as an service in Google Cloud Functions

@FBraz-RMFarma

Our bot is deployed as an service in Google Cloud Functions

No kidding? That's pretty cool.

I saw the ECONNRESET as well yesterday.

We saw this error once too, today at 6:15:57 UTC.

{
    "level": "error",
    "message": "An error occurred while handling the event FetchError: request to https://smba.trafficmanager.net/amer/v3/conversations/.../activities/... failed, reason: read ECONNRESET"
}

Yeah, we got 223 issues in the last hour ๐Ÿ˜ž

Yeah, we got 223 issues in the last hour ๐Ÿ˜ž

@avilabiel , now you have me worried! ๐Ÿ˜ฎ How did those 223 new issues impact your app or end users?

For our app, it caused an ephemeral error message, but the action succeeded with a retry.

Thank you everyone for sharing the issue here ๐Ÿ™๐Ÿป
For those who started to see the error, could you help me one more time by sharing the following information?

  1. Is this issue still happening? If so, how often does this issue occur? (ex: All requests are failing or some? If this happens intermittently, how often? (ex: 1%, 5%, 10% of the requests etc))
  2. When did you start to see the issue? (date with approximate time (with timezone) will be super appreciated ๐Ÿ™๐Ÿป)
  • If the issue stopped happening, could you share the last time you saw the issue? (again date with approximate time will be super helpful here)
  1. Could you share the end point with the region information?
    ex: https://smba.trafficmanager.net/amer/v3/conversations/.../activities/...
Screenshot 2024-07-10 at 8 37 28โ€ฏAM

Latest: 2024-07-10 08:00:42 am PDT (50 mins ago)
First: 2024-07-09 08:00:03 pm PDT
Occurrences: 37

Our app is heavily skewed to triggering on the hour, for certain hours, so ~50 minutes of no occurrences does not tell us much. It is intermittent, but I don't have immediate counts.

We see it for this endpoint (though it's our most heavily used, so this isn't necessarily an exhaustive list):

https://smba.trafficmanager.net/amer/v3/conversations", "method":"POST"

We are in active process for investigation & mitigation -quick two follow up questions here:

(1) For those failed requests, do you see those requests' responses containing header with the ms-cv?
(2) Could you let me now if you see the decrease of the errors starting 07/10/24 8:30 pm in UTC?

@YunnyChung ,

The read ECONNRESET error occurred multiple times for our app today (10/Jul/24). Some users missed important updates. So, we have postponed the app launch to a wider group.

(1) I did not notice any header with ms-cv in our logs.
(2) Yes, errors have decreased. They have not occurred since 19:05:49 UTC.

Some error timestamps (UTC): 6:15:57 (first), 14:16:51, 14:19:28, 15:21:01, 16:50:57, 19:05:49 (last).
Target: https://smba.trafficmanager.net/amer/v3/conversations/.../activites/...

Hey guys,
please let me know if the problems have stopped?

Hey guys, please let me know if the problems have stopped?

What I've been seeing on our end, is that our bot is now able to send a message to a channel, then reply to iself, and update its own messages. However, the bot is still not able to reply to user messages, both in private or in a channel.

Not working: User messsage > Reply
Working: User message > Create new message in channel > Reply to that new message > Edit the reply

I have NOT seen any read ECONNRESET errors since 10/Jul/24, 19:05:49 UTC. ๐ŸŽ‰

Our bot is able to reply to users again. All of our issues are now resolved. ๐ŸŽ‰