Support unicode email addresses

Question

Support unicode email addresses

Closed this issue 8 years ago · 34 comments

skeggse commented 9 years ago

See also buildmail:lib/buildmain.js line 890, and RFC 6530.

skeggse commented 8 years ago

0f795c5

👍1
🎉1

Answer 1 · 2016-02-03T17:04:14.000Z

Just processed a few 100,000s of emails and ran into a number of unicode emails that were failing e.g. êjness@something.com

Answer 2 · 2016-02-03T18:08:50.000Z

Ah, good to know this applies to someone. I'll take a look. Do you think you could submit a PR that adds some useful test cases for Unicode?

Answer 3 · 2016-05-06T06:27:23.000Z

what's the unicode character range we are aiming to check for as valid?
charCodeAt() method returns an integer between 0 and 65535 representing the UTF-16 code unit

Answer 4 · 2016-05-06T06:30:10.000Z

I haven't had a chance to read and digest the existing RFCs pertaining to unicode in email addresses. My intention is to follow the standard, which I hope will include the full unicode range. Sorry this hasn't moved forward, my school year finally ended, so I hope to have some time to work on this.

Answer 5 · 2017-02-06T18:44:24.000Z

Re: RFC citations and full implementation -
Here are our findings that we believe should inform the work. Please let me know what's missing or not lining up with your expectations.

RFC 6530

Internationalized to utilize the full range of Unicode characters.
Unicode C0 and C1 control characters are prohibited by RFC 5321 and RFC 5198.
References v6 of the Unicode standard.
Permits the use of UTF-8 encoded strings in email addresses, both local parts and domain names.

Answer 6 · 2017-02-06T18:57:18.000Z

Thanks for summarizing your research, that's super helpful! I'm busy at this exact moment, I'll try to follow up with some questions in about an hour.

Answer 7 · 2017-02-06T20:02:42.000Z

Do you think the Unicode v6 stipulation is important for the implementation itself, given js/v8 support for unicode, broadly speaking? Do you think, in pursuit of supporting unicode email addresses, that this module should provide a normalization function which would punycode the domain as appropriate? In the latter case, would the normalization function need to apply any transformations to the local part (apart from stripping out comments and folding whitespace) to make it compatible with any SMTP implementations?

Answer 8 · 2017-02-06T20:08:24.000Z

Do you think the Unicode v6 stipulation is important for the implementation itself, given js/v8 support for unicode, broadly speaking

I wouldn't think so, no. Just as a basic "support at minimum these characters" guide. I think you're right that the Node/v8 unicode support would cover the RFC.

From my understanding of the RFC (I most likely misunderstood at minimum one part :) ), punycoding should not be necessary. The RFC covers internationalization of not only the local/domain piece of email addresses, but "SMTP Extension for Internationalized Email Address" as well. I'm not 100% on that assertion though

Answer 9 · 2017-02-06T20:10:43.000Z

This is the part of the RFC I'm looking at:
Especially "such transformations imply that the originating user or system must have ASCII-only addresses available for all senders and recipients"

Downgrading before and after SMTP Transactions
An important issue with these extensions is how to handle
interactions between systems that support non-ASCII addresses and
legacy systems that expect ASCII. There is, of course, no problem
with ASCII-only systems sending to those that can handle
internationalized forms because the ASCII forms are just a proper
subset. But, when systems that support these extensions send mail,
they MAY include non-ASCII addresses for senders, receivers, or both
and might also provide non-ASCII header information other than
addresses. If the extension is not supported by the first-hop system
(i.e., the SMTP server accessed by the submission server acting as an
SMTP client), message-originating systems SHOULD be prepared to
either send conventional envelopes and message headers or to return
the message to the originating user so the message may be manually
downgraded to the traditional form, possibly using encoded words
[RFC2047] in the message headers. Of course, such transformations
imply that the originating user or system must have ASCII-only
addresses available for all senders and recipients. Mechanisms by
which such addresses may be found or identified are outside the scope

Answer 10 · 2017-02-06T20:23:57.000Z

Thanks for that. I brought up punycode because at some point, in order to be effective for a delivery mechanism, the domain name must be looked up in the DNS to find the IP address to connect to. As such, it must follow the requirements of a valid domain name, which I recall as being a subset of ASCII characters. Whether the email address is valid for the envelope is a separate issue, and I'd like to find a more definitive specification than "possibly using encoded words."

Answer 11 · 2017-02-06T20:25:32.000Z

Ah, ok, I'm with you. I'll keep digging through the RFC(s), but you may be right that punycoding the domain would be a useful feature.

Answer 12 · 2017-02-06T20:28:36.000Z

Yep. I have this dream (predicated on having free time) that I'll transform this library into a general-purpose email address validation, normalization and transformation library. There are parts of nodemailer that seem redundant wrt email address parsing and the subsequent paths of validation and delivery.

I also seem to recall that there was an RFC that proposed specific encodings for unicode compatibility for email addresses, which was maybe deprecated in favor of UTF-8? Not sure.

Answer 13 · 2017-02-06T20:34:07.000Z

Oh, cool!

I've only been able to find UTF-8 support in the RFCs I've crawled through so far (6530 being the main reference).

Answer 14 · 2017-02-06T20:34:34.000Z

To that end, it might just make sense to pull in a parser generator (or maybe PEG if it's sufficiently powerful) which supports the right Unicode features. Then this entire library would basically be generating the appropriate errors. Not entirely sure that error handing/recovery in existing parser generators is sufficient, though - the existing implementation does its best to output usable error codes with the hope that they might let users know what they did wrong.

Answer 15 · 2017-02-06T20:35:44.000Z

Gotcha. Maybe that idea was set aside in favor of the better supported existing UTF-8 encoding.

Answer 16 · 2017-02-06T20:43:55.000Z

So how would you feel about splitting this out into 2 bodies of work? One to allow for simple validation of emails containing non-ascii unicode characters, and one for building out the punycode normalizing function?

That way the basic email validation is useable sooner rather than later, and the long-term goal of the project doesn't get dropped off?

Answer 17 · 2017-02-06T20:51:05.000Z

Sounds good to me. Feel free to submit a pull request based on your findings (bonus points if it references specific sections of RFCs, though it sounds like some of the references are simply that there aren't many specific things to reference). I'd like it to pass #18, and not break existing tests. I'd also like you to run the PR against the code style checker for the hapijs org using lab (pretty sure that's just included in npm test).

One final thing to consider: I know UTF-8 has support for multibyte encoding of e.g. null characters. Pretty sure it's a non-issue for this, but just double-check that it won't have unintended impacts.

If you don't get around to this, I'll try to block out some time in the next day or so to take care of it.

Answer 18 · 2017-02-06T20:52:50.000Z

Sounds great, thanks @skeggse. I'll see what we can pull out this afternoon/tomorrow AM. :)

Answer 19 · 2017-02-06T22:52:09.000Z

RE: comments on #18

@skeggse even with unicode character support, this fails on the DNS lookup because of iana.com vs iana.org.

Hm, gotcha.

This is a good problem to notice. This also indicates that unicode support in the domain names will break checkDNS, as the resolve family of functions are not unicode compatible or aware, which goes back to my question about punycode. It seems we'll need just a little more to avoid breaking existing functionality in unexpected ways.

I think it would be sufficient to add two tests:

one test that tests the three email addresses from this PR, but includes checkDNS: false in the options
one test (preferably in something like test/tmp.js) that uses rewire to verify that dns.resolveMx is called with the appropriately punycoded string

I'd like the latter test to be separate because I'd prefer to avoid mocking/mimicking separately defined functionality. It's not the worse, but I'd prefer to find a canonical example i18n domain we can rely on, and use that in the normal test process.

EDIT: maybe you could also pull in the other test example email addresses from #19

Answer 20 · 2017-02-07T17:22:10.000Z

Relevant: nodejs/node#11218

Answer 21 · 2017-02-07T18:02:02.000Z

So, here's the thing... Domain punycode is necessary, and the Node punycode API is being deprecated in favor of the community punycode.js library.

nodejs/node#11218
https://nodejs.org/api/punycode.html

As I move forward with this work I'm going to try out rewire and punycode.js.

Answer 22 · 2017-02-07T18:21:41.000Z

@skeggse looks like using rewire for the dns call will require me to refactor const Dns = require('dns'); to let Dns = require('dns');.

I know that is technically against Hapi code standards since Dns is not reassigned inside of lib/index.js, but how do you feel about this?
Line 5: prefer-const - 'Dns' is never reassigned. Use 'const' instead.

jhnns/rewire#79

Answer 23 · 2017-02-07T18:25:20.000Z

Ah, interesting. I'm tempted to refactor the API, and upgrade isemail to a new major rev, without support for checkDNS - I feel that that's almost a separate concern from isemail itself, and this would require users to figure out what exactly they want to check for. For some users it might be sufficient if any record exists for the domain name. In other cases, they want to verify that the server will accept the given localpart as a mailbox. Maybe that's outside the scope of the work you want to do.

I think using let for the moment is fine, but I know hapijs prefers to keep module imports const, generally speaking.

Answer 24 · 2017-02-07T18:27:17.000Z

Thanks! I'll use let for now.

Just as a side note, I would support a major rev that drops checkDNS. :D

Answer 25 · 2017-02-15T22:35:34.000Z

@Marsup I know joi uses this package - do you think I should release this as 3.0.0 instead of 2.3.0 to avoid compatibility issues? I'm a little concerned that maybe some people depend on this rejecting unicode characters. Thoughts?

Answer 26 · 2017-02-16T00:55:22.000Z

I am not worried about that "breaking change". I am worried about the new external dependency.

Answer 27 · 2017-02-16T02:51:01.000Z

Hm, granted. Alternative idea: I want to stop supporting the dns resolution check. As a result, we wouldn't need punycode.js, and would be able to simplify the API.

On that note, @WesTyler: I just realized that while we're checking the number of octets it takes to represent the domain as octets, it doesn't account for the extra four bytes, or the actual encoding scheme used by punycode. The simplest solution there is to just use punycode.js, but maybe we can get it to work without punycode (and then could get rid of the external dependency altogether).

Answer 28 · 2017-02-16T08:26:54.000Z

FWIW, this module is the one advised by node core itself, so I'd assume this is a safe dependency.
What do you consider breaking ? The fact that it now supports unicode characters ?

Answer 29 · 2017-02-16T08:45:19.000Z

It is still an external dep that I need to review every time they change something. If there is a way not to require it, that would be preferred.

Answer 30 · 2017-02-16T15:03:10.000Z

but maybe we can get it to work without punycode (and then could get rid of the external dependency altogether)

@skeggse so are you proposing removing the normalization, or rolling out an internal normalization implementation...?

Answer 31 · 2017-02-16T16:51:21.000Z

No, the normalization is here to stay, and it doesn't handle punycode anyway.

At the moment, it seems that dns checking is superfluous to the goals of this module, and getting rid of it would also mean we wouldn't need the output of punycode. We'd need to figure out how to calculate what a punycoded domain's length would be, though.

Down the line, though, I'd like to have this module provide parsing and normalization (to avoid duplicate processing of email addresses), which would necessitate punycode. So not entirely sure that's the right way to go, but it would solve the problem temporarily.

Answer 32 · 2017-02-16T17:09:32.000Z

Oh, my bad, sorry. I forgot that the punycoding and normalization were 2 different steps for 2 different problems XD.

I agree with the judgement that dns checking (and therefore punycoding) is likely superfluous for this module right now. Is it worth opening up a new issue to isolate that effort from the Unicode work in this ticket and the merged PR? If so I can open it up and pull the info from above into it. I can probably carve out some time this afternoon or tomorrow morning to work on it if nobody else gets to it first.

Answer 33 · 2017-02-16T21:25:09.000Z

Yeah that sgtm