michaelherold/pyIsEmail

Why even is_email('##@aa') == True ?

Igotit opened this issue · 2 comments

##@aa is obviously a wrong email address, but this library returns True.

Is there any reference that this is a valid email address?

Good question! I have a practical answer for you, a technical answer as to why this is an email address and some thoughts about a future change below.

Practical Answer

This is a valid email address (see the Technical Answer portion below for why) but you were expecting it not to be, it sounds like you're looking to do a DNS check against the domain-part. You can do this like so:

is_email("###@aaa", check_dns=True)
#=> False

That returns your expected result.

Technical Answer

In order to verify whether or not this is an email address, we have to look at the specifications. We can answer your question by following only the chain in RFC5322.

[addr-spec](https://tools.ietf.org/html/rfc5322#section-3.4.1) = local-part "@" domain

So we see that an address is made up of a local-part and a domain separated by an @. We have an @ so we need to look at the local-part and domain to see if they are valid.

local-part

First, we must look at the definition of the local part of the address:

local-part      =   dot-atom / quoted-string / obs-local-part
dot-atom        =   [CFWS] dot-atom-text [CFWS]

# We will ignore the CFWS because there is no whitespace in your example

dot-atom-text   =   1*atext *("." 1*atext)
atext           =   ALPHA / DIGIT /    ; Printable US-ASCII
                    "!" / "#" /        ;  characters not including
                    "$" / "%" /        ;  specials.  Used for atoms.
                    "&" / "'" /
                    "*" / "+" /
                    "-" / "/" /
                    "=" / "?" /
                    "^" / "_" /
                    "`" / "{" /
                    "|" / "}" /
                    "~"

# So your "###" parses as 3*atext.

We see that # is a character that is included in atext, which means that ### is a valid local-part.

domain

Next, we must look at the definition of the domain part of the address.

domain          =   dot-atom / domain-literal / obs-domain
dot-atom        =   [CFWS] dot-atom-text [CFWS]

# We will ignore the CFWS because there is no whitespace in your example

dot-atom-text   =   1*atext *("." 1*atext)
atext           =   ALPHA / DIGIT /    ; Printable US-ASCII
                    "!" / "#" /        ;  characters not including
                    "$" / "%" /        ;  specials.  Used for atoms.
                    "&" / "'" /
                    "*" / "+" /
                    "-" / "/" /
                    "=" / "?" /
                    "^" / "_" /
                    "`" / "{" /
                    "|" / "}" /
                    "~"

# So your "aaa" parses as 3*atext.

Notice that there is no requirement for the domain part to not be a Top-Level Domain (TLD) (interesting sidenote: this is a minor plot point in Neil Stephenson's novel, Cryptonomicon). While the vast majority of us do not have access to the MX records of a TLD, someone at Google could set an MX record for .goog and have an email address of root@goog - that is a valid email address.

Possible Improvement

The address is valid under the RFC5321 TLD diagnosis, but that diagnosis is nearly always masked by the DNS check since it's part of the DNS validator that is only run when the check of the DNS records fails.

Perhaps it would be worth splitting out those diagnoses into a separate validator and introduce a "strict mode" that disallows all RFC5321 or DNS diagnoses from being reported. That's commonly what people want or expect even though is technically incorrect. I'll think about it.

Great explanation! Thank you!