jstedfast/MailKit

SmtpClient Exception Behavior Change in 4.4.0

nitinag opened this issue ยท 26 comments

It looks like SmtpClient had a change in 4.4.0 with its exception behavior on SendAsync. We started to get significantly more SmtpProtocolExceptions with the "unexpectedly disconnected" error message. Prior to 4.4.0, we rarely got SmtpProtocolException and they were mostly all SmtpCommandException.

It's definitely useful to know that the server has disconnected, which wasn't available before so we can better decide on reuse of the SmtpClient. However, we've now lost access to the StatusCode property that was on SmtpCommandException since many of those exceptions are now being reported as SmtpProtocolException in 4.4.0+. 4.5.0 added the server response message to SmtpProtocolException, which is definitely helpful, but still no StatusCode as before, which is incredibly important to know so we can decide what to do with the message based on the server's response.

  • Can a nullable StatusCode property be added to SmtpProtocolException similar to SmtpCommandException? It'd have to be nullable since it's not present in every case of SmtpProtocolException since the server could disconnect without a status code or a response.

  • Can a ReplyText or similar property be added to both SmtpCommandException and SmtpProtocolException (nullable ReplyText) so that we can access the raw response from the server. Right now we have to resort to parsing the exception message and taking out the "unexpectedly disconnected" (since 4.4.0+) and/or other extra added wording. It creates a task to follow/update future prefixed wording changes, etc. It'd be much better to have a property to access that. We need to reliably parse the enhanced status code from the response in all cases (success/error) and check the text to make decisions, and for troubleshooting. We've been able to easily get the success case since SendAsync returns that as of #1161.

Part of this behavior change (useful to know server disconnected) also includes this exception for successful delivery in some cases (no way to know what the status code was, but guessing it was 2xx based on the response, so we check for that now too):
"MailKit.Net.Smtp.SmtpProtocolException: The SMTP server has unexpectedly disconnected: 2.1.0 OK xxxxxxxxxxxxxxxxxxxxxxxxxxxxx - gsmtp"

Hmmmm...

We started to get significantly more SmtpProtocolExceptions with the "unexpectedly disconnected" error message. Prior to 4.4.0, we rarely got SmtpProtocolException and they were mostly all SmtpCommandException.

This sounds like a bug. If a command is failing, it should throw SmtpCommandException and it should never get to a situation where it throws "unexpected disconnect" SmtpProtocolExceptions.

That said, for exceptions like MailKit.Net.Smtp.SmtpProtocolException: The SMTP server has unexpectedly disconnected: 2.1.0 OK xxxxxxxxxxxxxxxxxxxxxxxxxxxxx - gsmtp, I don't think we have any other option than to do what you propose regarding the StatusCode? and ResponseText? properties on SmtpProtocolException.

However, we've now lost access to the StatusCode property that was on SmtpCommandException since many of those exceptions are now being reported as SmtpProtocolException in 4.4.0+

I'd really like to know more about this because this sounds bad if those used to be SmtpCommandExceptions.

The way we caught this is when switching from 4.3.0 -> 4.5.0, our outbound queue started to build up significantly. After some digging into our logs, I was able to confirm this exception change behavior since the root cause of the queue build up was that we didn't know the status code of the new volume of SmtpProtocolExceptions. We had a much lower volume of these prior to 4.4.0. There was no response passed as part of the SmtpProtocolException prior to 4.5.0 so our older logs don't have any additional details on that exception for the old logs, just MailKit.Net.Smtp.SmtpProtocolException: The SMTP server has unexpectedly disconnected. for that time.

My working theory was that the changes/fixes made in 4.4.0 caused better detection of smtp server disconnects and thus caused the switchover of many SmtpCommandException to SmtpProtocolException for many of these responses. It's not uncommon for a smtp server to disconnect after accepting/rejecting a message. It could also be a different bug causing it to think its disconnected, but it definitely started in 4.4.0. We started getting a more balanced amount of both of these exceptions then while with 4.3.0 they were mostly SmtpCommandException. For now, we've reverted to 4.3.0 since we need the status code on failures to get them off our queue.

While you're looking at this:
ResponseText and StatusCode properties would be also useful for ServiceNotAuthenticatedException in addition to SmtpProtocolException and SmtpProtocolException so that all the error cases are covered to get these details. The success case (SendAsync) doesn't have StatusCode, but is always SmtpStatusCode.Ok (250) based on the code so that's easy enough to document.

Okay, so there are essentially 5 locations where an SmtpProtocolException gets thrown: 3 of them are in the response parsing logic (but none of these say "unexpectedly disconnected") and the last 2 are in the ReadAhead() and ReadAheadAsync() logic which is where the "unexpectedly disconnected" error is used.

That probably means that the server really did unexpectedly disconnect and, as you have suggested, it is probably a sign that MailKit's SmtpClient stopped swallowing these like it had been previously (and I think this might be evidenced by the code changes between 4.3.0 and 4.4.0 as well).

These are the changes that likely played a part in this behavior change:

commit 0a5029d
Author: Jeffrey Stedfast jestedfa@microsoft.com
Date: Mon Jan 15 12:58:55 2024 -0500

Refactored SmtpClient to share disconnect-on-error logic

commit 4310535
Author: Jeffrey Stedfast jestedfa@microsoft.com
Date: Mon Jan 15 12:29:20 2024 -0500

Fix SmtpCLient's re-EHLO logic to disconnect on errors

Also fixed the protected SendCommand() method to disconnect on errors.

commit 3d1487d
Author: Jeffrey Stedfast jestedfa@microsoft.com
Date: Mon Jan 15 12:27:57 2024 -0500

Fixed previous commit to only disconnect when the Exception isn't AuthenticationException

commit d5770f9
Author: Jeffrey Stedfast jestedfa@microsoft.com
Date: Mon Jan 15 10:19:34 2024 -0500

Fixed SmtpClient to disconnect during Authenticate/Async on socket errors

For ServiceNotAuthenticatedException, the StatusCode would always 530 and the ResponseText would be the same as the Message string.

Yes, that's how we handle ServiceNotAuthenticatedException right now and match that based on that assumption from the current code, but it would be more explicit for all three exceptions to use the same properties so it prevents breaking that assumption in future code changes and one can use consistent properties to access those values.

I might have to add an SmtpServiceNotAuthenticatedException to add those properties because they won't map as easily for POP3 and IMAP.

Question: When you get MailKit.Net.Smtp.SmtpProtocolException: The SMTP server has unexpectedly disconnected: 2.1.0 OK xxxxxxxxxxxxxxxxxxxxxxxxxxxxx - gsmtp, is it when calling a NoOpAsync() or something?

It happens on SmtpClient.SendAsync.

at MailKit.Net.Smtp.SmtpClient.SendAsync(FormatOptions options, MimeMessage message, MailboxAddress sender, IList`1 recipients, CancellationToken cancellationToken, ITransferProgress progress)

That ... doesn't make sense unless there's a bug in the SendAsync logic. I don't understand how that could happen :(

That response text suggests that this is happening after the SMTP server sends 250 2.1.0 OK xxxxxxxxxxxxxxxxxxxxxxxxxxxxx - gsmtp which would indicate success. If that gets a successful response, though, it would not (or at least should not) attempt any other commands to the server which means it couldn't get an SmtpProtocolException.

I'm so lost as to what could be happening there.

We're assuming its a 250 status code. Right now we don't know the actual status code. Since SendAsync also calls MailFrom, RcptTo, etc., is it possible it's from one of those? I guess right now, those exceptions also don't let us know which smtp action caused the exception. Maybe a property should indicate that as well on the exceptions.

Given we're getting that exception mostly on 4xx/5xx, I thought what was happening is that it was a 250 response and the message accepted, but the exception was raised because SmtpClient detected the server did disconnect after accepting.

No, it would only come as a response to sending the message content. That xxxxxxx... is a "tracking id".

Do you have a MessageSent event handler? If it throws an exception, that looks like it'd get caught in SendAsync() causing SmtpClient to send the RSET command, but if the server disconnected after the send, then this would fail with a SmtpProtocolException "server unexpectedly disconnected" followed by the last response that the client received (which would be the 2.1.0 OK xxxxxxxxxxxxxxxxxxxxxxxxxxxxx - gsmtp string).

Yes, I did have a MessageSent event handler to capture args.Response until last night when I took it out to replace it with the response from SendAsync.

Actually, my analysis is wrong I think, because it only calls RSET if it gets an SmtpCommandException or a ServiceNotAuthenticatedException, but then it will throw the original exception (or should).

I think I might need a log so I can better understand what is happening.

Most of the SmtpProtocolExceptions we're getting are with 4.x.x and 5.x.x. The 2.x.x SmtpProtocolException are only a handful every day. Given our smtp volume, it would be difficult to enable/provide a full protocol log. I also don't have a specific case to easily reproduce, we can just see it in our overall logs of each smtp session. If there was a runtime property/details that would be helpful to you, I can easily add logging for that and monitor.

Are you saying that its not expected for SmtpClient to throw SmtpProtocolException if the server disconnects after responding with 250 2.1.0 OK? I thought it seemed consistent with what was happening for the other "server unexpectedly disconnected" 4.x.x and 5.x.x with the changes in 4.4.0.

Also, we do get a "tracking id" for every server response from gsmtp regardless the command. It's the same id that's appended to all responses for that session/transaction (at least for gsmtp).

We have commonly seen gsmtp respond with 250 2.1.0 OK xxxxxxxxxxxxxxxxxxxxxxxxxxxxx - gsmtp in response to a MAIL FROM. Maybe it could/is likely that? A property identifying the smtp command the exception was raised from would probably be helpful to identify which part of the smtp transaction failed or disconnected since that could mean different actions need to be taken.

Edit: Confirmed, this is likely in response to MAIL FROM. Even our mail server sends 250 2.1.0 OK in that case and we have internal docs with examples from other mail servers sending that enhanced response code for MAIL FROM responses.

Most of the SmtpProtocolExceptions we're getting are with 4.x.x and 5.x.x.

That's what I would expect and gives me some measure of relief :-)

The 2.x.x SmtpProtocolException are only a handful every day.

That's a relief that it's a small number, but I'm still baffled by it because the 2.x.x would insinuate success.

Are you saying that its not expected for SmtpClient to throw SmtpProtocolException if the server disconnects after responding with 250 2.1.0 OK?

Let me try to explain my thinking more clearly. What I'm saying is that the SmtpClient, after receiving a 250 ... response to the message data (second part of the DATA command sequence), should not have a need to send any other command and so it would have no way of discovering that the server has unexpectedly disconnected.

The "unexpectedly disconnected" error ONLY happens if/when the client attempts to read a response and gets an EOF indicating the remote server has closed the connection.

This cannot happen as part of reading a full response because SMTP responses are structured such that the client can figure out if it needs to read more data or not without having to "try-and-fail". E.g. responses are line-based and the status code is followed by - if there is another line (or more) or <SPACE> if it is the last line of the response.

All that said, it looks like my assumptions about which command was getting this response was incorrect. It looks like it was a MAIL FROM response rather than a message data response.

Also, we do get a "tracking id" for every server response from gsmtp regardless the command. It's the same id that's appended to all responses for that session/transaction.

Yea, and I think I jumped to the wrong conclusion earlier. I think what that "tracking code" really is is a Session ID or Correlation ID that Google likely uses for looking up session telemetry data as opposed to a "message tracking id".

Also, we do get a "tracking id" for every server response from gsmtp regardless the command. It's the same id that's appended to all responses for that session/transaction (at least for gsmtp).

Yep.

We have commonly seen gsmtp respond with 250 2.1.0 OK xxxxxxxxxxxxxxxxxxxxxxxxxxxxx - gsmtp in response to a MAIL FROM. Maybe it could/is likely that?

This makes WAY more sense.

According to https://www.rfc-editor.org/rfc/rfc3463#section-3.2 ... the 2.1.0 suggests successful enhanced status code (2.x.x) but the sender(?) address has some kind of issue(?) (x.1.0 Other address status).

If the response was 100% ok, it should be 2.0.0 instead of 2.1.0.

If this is in response to a MAIL FROM command, then this would seem to jive with the Enhanced Status Code info.

It also might be why you are getting disconnected. If the address has an issue, GMail might be deciding to drop the connection and maybe that issue is an authentication issue or maybe it's a temporary throttling issue.

A property identifying the smtp command the exception was raised from would probably be helpful to identify which part of the smtp transaction failed or disconnected since that could mean different actions need to be taken.

Right.

The "unexpectedly disconnected" error ONLY happens if/when the client attempts to read a response and gets an EOF indicating the remote server has closed the connection.

That's what I had thought as well and was surprised with the new SmtpProtocolExceptions included disconnect detection on DATA/BDAT responses.

It's does seem to be throwing in response to DATA/BDAT in many cases since our logs contain plenty of SmtpProtocolExceptions referencing the content of the message, DKIM did not pass, etc.

Maybe it is a bug somewhere in that case because I'm assuming you wouldn't generally detect the disconnect until the read of the next command after a DATA/BDAT response has already been processed?

Here is the stack trace for one of those:
at MailKit.Net.Smtp.SmtpStream.ReadAheadAsync(CancellationToken cancellationToken) at MailKit.Net.Smtp.SmtpStream.ReadResponseAsync(CancellationToken cancellationToken) at MailKit.Net.Smtp.SmtpStream.SendCommandAsync(String command, CancellationToken cancellationToken) at MailKit.Net.Smtp.SmtpClient.SendCommandInternalAsync(String command, CancellationToken cancellationToken) at MailKit.Net.Smtp.SmtpClient.ResetAsync(CancellationToken cancellationToken) at MailKit.Net.Smtp.SmtpClient.SendAsync(FormatOptions options, MimeMessage message, MailboxAddress sender, IList`1 recipients, CancellationToken cancellationToken, ITransferProgress progress)

because I'm assuming you wouldn't generally detect the disconnect until the read of the next command after a DATA/BDAT response has already been processed?

Correct, that is the way things are expected to work which means there may be a bug.

In your sample stacktrace, it looks like something is causing one of the "SendAsync" commands to fail (MAIL FROM, RCPT TO, or DATA/BDAT) in the form of an SmtpCommandException (or ServerNotAuthenticatedException, I suppose) which then causes RSET to be sent and based on the StackTrace you provided, might suggest that this is then causing an SmtpProtocolException to be thrown.

One reason you may be getting more SmtpProtocolExceptions now is that perhaps RSET exceptions are not being caught anymore but were before(?).

Most of the SmtpProtocolException have gone away once we switched back to 4.3.0 and the very few present after that point don't show ResetAsync in the stack trace.

Okay, that should improve the situation a bit. Hopefully.

Just checked out v4.3.0 and indeed, the old 4.3.0 code used to swallow RSET exceptions.

So, it seems in most error cases MailKit will know that the server disconnected due to the attempt to RSET after a bad DATA/BDAT 4xx/5xx response or bad MAIL FROM/RCPT TO response. In those cases, could we expect SmtpClient.IsConnected (or some other way) to report accurately (since MailKit already knows) so we can still detect/log that the server has disconnected while processing the response for our connection management logic?

The StatusCode, ResponseText, and ~SourceSmtpCommand properties would still be very useful to have for all three of the exceptions so we can log and take action on exactly what is happening and where.

So, it seems in most cases MailKit will know that the server disconnected due to the attempt to RSET after a bad DATA/BDAT 4xx/5xx response or bad MAIL FROM/RCPT TO response. In those cases, could we expect SmtpClient.IsConnected (or some other way) to report accurately (since MailKit already knows) so we can still detect/log that the server has disconnected while processing the response for our connection management logic?

I would never depend fully on this being accurate if IsConnected=True, but if IsConnected=False, then yes.

Keep in mind that if the DATA/BDAT command is successful and then the server disconnects, SmtpClielt.IsConnected will still be True until you attempt to send the next message or call NoOpAsync().

I released MailKit 4.6.0 with the fixes I have for this so far which restores behavior back to what it was pre-4.4.0