skeggse/isemail

Support unicode email addresses

Closed this issue · 34 comments

Just processed a few 100,000s of emails and ran into a number of unicode emails that were failing e.g. êjness@something.com

Ah, good to know this applies to someone. I'll take a look. Do you think you could submit a PR that adds some useful test cases for Unicode?

what's the unicode character range we are aiming to check for as valid?
charCodeAt() method returns an integer between 0 and 65535 representing the UTF-16 code unit

I haven't had a chance to read and digest the existing RFCs pertaining to unicode in email addresses. My intention is to follow the standard, which I hope will include the full unicode range. Sorry this hasn't moved forward, my school year finally ended, so I hope to have some time to work on this.

Re: RFC citations and full implementation -
Here are our findings that we believe should inform the work. Please let me know what's missing or not lining up with your expectations.

RFC 6530

  • Internationalized to utilize the full range of Unicode characters.
  • Unicode C0 and C1 control characters are prohibited by RFC 5321 and RFC 5198.
  • References v6 of the Unicode standard.
  • Permits the use of UTF-8 encoded strings in email addresses, both local parts and domain names.

Thanks for summarizing your research, that's super helpful! I'm busy at this exact moment, I'll try to follow up with some questions in about an hour.

Do you think the Unicode v6 stipulation is important for the implementation itself, given js/v8 support for unicode, broadly speaking? Do you think, in pursuit of supporting unicode email addresses, that this module should provide a normalization function which would punycode the domain as appropriate? In the latter case, would the normalization function need to apply any transformations to the local part (apart from stripping out comments and folding whitespace) to make it compatible with any SMTP implementations?

Do you think the Unicode v6 stipulation is important for the implementation itself, given js/v8 support for unicode, broadly speaking

I wouldn't think so, no. Just as a basic "support at minimum these characters" guide. I think you're right that the Node/v8 unicode support would cover the RFC.

From my understanding of the RFC (I most likely misunderstood at minimum one part :) ), punycoding should not be necessary. The RFC covers internationalization of not only the local/domain piece of email addresses, but "SMTP Extension for Internationalized Email Address" as well. I'm not 100% on that assertion though

This is the part of the RFC I'm looking at:
Especially "such transformations imply that the originating user or system must have ASCII-only addresses available for all senders and recipients"

Downgrading before and after SMTP Transactions
An important issue with these extensions is how to handle
interactions between systems that support non-ASCII addresses and
legacy systems that expect ASCII. There is, of course, no problem
with ASCII-only systems sending to those that can handle
internationalized forms because the ASCII forms are just a proper
subset. But, when systems that support these extensions send mail,
they MAY include non-ASCII addresses for senders, receivers, or both
and might also provide non-ASCII header information other than
addresses. If the extension is not supported by the first-hop system
(i.e., the SMTP server accessed by the submission server acting as an
SMTP client), message-originating systems SHOULD be prepared to
either send conventional envelopes and message headers or to return
the message to the originating user so the message may be manually
downgraded to the traditional form, possibly using encoded words
[RFC2047] in the message headers. Of course, such transformations
imply that the originating user or system must have ASCII-only
addresses available for all senders and recipients. Mechanisms by
which such addresses may be found or identified are outside the scope

Thanks for that. I brought up punycode because at some point, in order to be effective for a delivery mechanism, the domain name must be looked up in the DNS to find the IP address to connect to. As such, it must follow the requirements of a valid domain name, which I recall as being a subset of ASCII characters. Whether the email address is valid for the envelope is a separate issue, and I'd like to find a more definitive specification than "possibly using encoded words."

Ah, ok, I'm with you. I'll keep digging through the RFC(s), but you may be right that punycoding the domain would be a useful feature.

Yep. I have this dream (predicated on having free time) that I'll transform this library into a general-purpose email address validation, normalization and transformation library. There are parts of nodemailer that seem redundant wrt email address parsing and the subsequent paths of validation and delivery.

I also seem to recall that there was an RFC that proposed specific encodings for unicode compatibility for email addresses, which was maybe deprecated in favor of UTF-8? Not sure.

Oh, cool!

I've only been able to find UTF-8 support in the RFCs I've crawled through so far (6530 being the main reference).

To that end, it might just make sense to pull in a parser generator (or maybe PEG if it's sufficiently powerful) which supports the right Unicode features. Then this entire library would basically be generating the appropriate errors. Not entirely sure that error handing/recovery in existing parser generators is sufficient, though - the existing implementation does its best to output usable error codes with the hope that they might let users know what they did wrong.

Gotcha. Maybe that idea was set aside in favor of the better supported existing UTF-8 encoding.

So how would you feel about splitting this out into 2 bodies of work? One to allow for simple validation of emails containing non-ascii unicode characters, and one for building out the punycode normalizing function?

That way the basic email validation is useable sooner rather than later, and the long-term goal of the project doesn't get dropped off?

Sounds good to me. Feel free to submit a pull request based on your findings (bonus points if it references specific sections of RFCs, though it sounds like some of the references are simply that there aren't many specific things to reference). I'd like it to pass #18, and not break existing tests. I'd also like you to run the PR against the code style checker for the hapijs org using lab (pretty sure that's just included in npm test).

One final thing to consider: I know UTF-8 has support for multibyte encoding of e.g. null characters. Pretty sure it's a non-issue for this, but just double-check that it won't have unintended impacts.

If you don't get around to this, I'll try to block out some time in the next day or so to take care of it.

Sounds great, thanks @skeggse. I'll see what we can pull out this afternoon/tomorrow AM. :)

RE: comments on #18

@skeggse even with unicode character support, this fails on the DNS lookup because of iana.com vs iana.org.

Hm, gotcha.

This is a good problem to notice. This also indicates that unicode support in the domain names will break checkDNS, as the resolve family of functions are not unicode compatible or aware, which goes back to my question about punycode. It seems we'll need just a little more to avoid breaking existing functionality in unexpected ways.

I think it would be sufficient to add two tests:

  • one test that tests the three email addresses from this PR, but includes checkDNS: false in the options
  • one test (preferably in something like test/tmp.js) that uses rewire to verify that dns.resolveMx is called with the appropriately punycoded string

I'd like the latter test to be separate because I'd prefer to avoid mocking/mimicking separately defined functionality. It's not the worse, but I'd prefer to find a canonical example i18n domain we can rely on, and use that in the normal test process.

EDIT: maybe you could also pull in the other test example email addresses from #19

So, here's the thing... Domain punycode is necessary, and the Node punycode API is being deprecated in favor of the community punycode.js library.

nodejs/node#11218
https://nodejs.org/api/punycode.html

As I move forward with this work I'm going to try out rewire and punycode.js.

@skeggse looks like using rewire for the dns call will require me to refactor const Dns = require('dns'); to let Dns = require('dns');.

I know that is technically against Hapi code standards since Dns is not reassigned inside of lib/index.js, but how do you feel about this?
Line 5: prefer-const - 'Dns' is never reassigned. Use 'const' instead.

jhnns/rewire#79

Ah, interesting. I'm tempted to refactor the API, and upgrade isemail to a new major rev, without support for checkDNS - I feel that that's almost a separate concern from isemail itself, and this would require users to figure out what exactly they want to check for. For some users it might be sufficient if any record exists for the domain name. In other cases, they want to verify that the server will accept the given localpart as a mailbox. Maybe that's outside the scope of the work you want to do.

I think using let for the moment is fine, but I know hapijs prefers to keep module imports const, generally speaking.

Thanks! I'll use let for now.

Just as a side note, I would support a major rev that drops checkDNS. :D

@Marsup I know joi uses this package - do you think I should release this as 3.0.0 instead of 2.3.0 to avoid compatibility issues? I'm a little concerned that maybe some people depend on this rejecting unicode characters. Thoughts?

I am not worried about that "breaking change". I am worried about the new external dependency.

Hm, granted. Alternative idea: I want to stop supporting the dns resolution check. As a result, we wouldn't need punycode.js, and would be able to simplify the API.

On that note, @WesTyler: I just realized that while we're checking the number of octets it takes to represent the domain as octets, it doesn't account for the extra four bytes, or the actual encoding scheme used by punycode. The simplest solution there is to just use punycode.js, but maybe we can get it to work without punycode (and then could get rid of the external dependency altogether).

FWIW, this module is the one advised by node core itself, so I'd assume this is a safe dependency.
What do you consider breaking ? The fact that it now supports unicode characters ?

It is still an external dep that I need to review every time they change something. If there is a way not to require it, that would be preferred.

but maybe we can get it to work without punycode (and then could get rid of the external dependency altogether)

@skeggse so are you proposing removing the normalization, or rolling out an internal normalization implementation...?

No, the normalization is here to stay, and it doesn't handle punycode anyway.

At the moment, it seems that dns checking is superfluous to the goals of this module, and getting rid of it would also mean we wouldn't need the output of punycode. We'd need to figure out how to calculate what a punycoded domain's length would be, though.

Down the line, though, I'd like to have this module provide parsing and normalization (to avoid duplicate processing of email addresses), which would necessitate punycode. So not entirely sure that's the right way to go, but it would solve the problem temporarily.

Oh, my bad, sorry. I forgot that the punycoding and normalization were 2 different steps for 2 different problems XD.

I agree with the judgement that dns checking (and therefore punycoding) is likely superfluous for this module right now. Is it worth opening up a new issue to isolate that effort from the Unicode work in this ticket and the merged PR? If so I can open it up and pull the info from above into it. I can probably carve out some time this afternoon or tomorrow morning to work on it if nobody else gets to it first.

Yeah that sgtm