bbottema/simple-java-mail

Catastrophic backtracking in validation regexes

GoogleCodeExporter opened this issue · 7 comments

There are some email addresses that behave *very* poorly with the validation 
done in EmailValidationUtil.  I think it might be due to the nested quantifiers 
in the complex regexes there.  They literally take hours to finish the 
validation, using 100% CPU.

Is there any way to fix this, and barring that, can an option be added to skip 
validation?

To reproduce:
1. Try to send an email to an address like 
309d4696df38ff12c023600e3bc2bd4b@fakedomain.com
2. Wait for computer to explode

(Using java 1.6.0_31)

Original issue reported on code.google.com by semico...@gmail.com on 20 Apr 2012 at 10:13

The regex expressions come from another open source project, 
http://code.google.com/p/emailaddress/source/browse/.

The problem is that the class has exploded over there so I would have to patch 
our own version. Alas, regular expressions is not my specialty. Any thoughts on 
how to fix this?

Original comment by b.bottema on 9 Aug 2012 at 7:36

seanf commented

It looks like the email validator might move here (eventually): https://github.com/lhazlewood/jeav

Hmm, not anytime soon I'm afraid, considering the last commit was from 2011 and the validation logic was initially created in 2006. I see you're trying to get that to move. Good. Let's see.

seanf commented

Hmm, not anytime soon I'm afraid, considering the last commit was from 2011 and the validation logic was initially created in 2006. I see you're trying to get that to move. Good. Let's see.

True, I just thought I'd put in the link in case it's hard to find later on (in the hope that the migration does in fact happen at some point).

@bbottema What exactly did you mean by "the class has exploded" in #3 (comment)? (assuming you can remember what you meant back that far!) Perhaps just the mess of untested regexes? Since the code seems to be unmaintained anyway, it might be worth bringing it in, assuming it's worth keeping at all.

It looks like the performance problems were known even before the validator project was set up, and not likely to be fixed any time soon: http://leshazlewood.com/2006/11/06/emailaddress-java-class/comment-page-1/#comment_count

In any case, there's a school of thought that client code shouldn't even try to validate email addresses perfectly, for instance http://davidcel.is/posts/stop-validating-email-addresses-with-regex/. It's too easy to get it wrong and reject email addresses which are actually valid, before you even start worrying about things like catastrophic backtracking.

It's probably enough to check for @ or <somename> something@something, and let the SMTP server worry about anything more sophisticated if need be. In view of the potential for erroneous rejections or performance problems and the general unmaintained and untested state of the validation library, I would suggest that Simple Java Mail shouldn't bother, or should at least make the address validation optional. (Or did I miss an option which already exists?)

Some users may care about RFC-compliant bodies, but not about 100% RFC compliant addresses. Personally, I would prefer a risk of letting a non-compliant address through (perhaps to be rejected by the SMTP server) over the risk of wasting hours of CPU.

What exactly did you mean by "the class has exploded" in #3 (comment)? (assuming you can remember what you meant back that far!) Perhaps just the mess of untested regexes? Since the code seems to be unmaintained anyway, it might be worth bringing it in, assuming it's worth keeping at all.

@seanf I think at the time it was impossible to easily update to a next version of a validation library, as their version had moved away too much. I think they were halfway switching between an overgrown regex version and doing it in native java without regex.

In any case, there's a school of thought that client code shouldn't even try to validate email addresses perfectly, for instance http://davidcel.is/posts/stop-validating-email-addresses-with-regex/. It's too easy to get it wrong and reject email addresses which are actually valid, before you even start worrying about things like catastrophic backtracking.

This is a valid point. The point was provide early detection and friendly errors. It's a balance between helping the end-user with a abstract friendly API layer and letting him deal with the technical depths of native libraries and errors. It's why simple-java-mail exists in the first place, but maybe we should draw the line at email validation, simply because we don't master the subject and it is completely untested.

I would suggest that Simple Java Mail shouldn't bother, or should at least make the address validation optional. (Or did I miss an option which already exists?)

Currently not, but it is worth adding. There is a way to configure the validation criteria, by simply setting it on a Mailer instance. I will look into it.

Some users may care about RFC-compliant bodies, but not about 100% RFC compliant addresses. Personally, I would prefer a risk of letting a non-compliant address through (perhaps to be rejected by the SMTP server) over the risk of wasting hours of CPU.

I agree, the purpose of this library is to provide an easy way to handle complex mail bodies that behave consistently across the many email readers. Email validation is secondary to that. However, if there is a good library out there that properly validates email, I would still like a facility like that.