Hostname validator does not accept .name IDNs
adamlundrigan opened this issue · 14 comments
The Hostname class keeps an internal list of IDNs (here) and .name
is not on it
Example: nødtvedt.name (xn--ndtvedt-q1a.name) is a valid TLD (ref: http://whois.domaintools.com/xn--ndtvedt-q1a.name) but Hostname does not accept it.
We need to run the update script.
name
shows up in the $validTlds
list (here), which IIRC is where the update script would put it. The problem is that it is not identified as a TLD which supports IDNA (validIdns
list). I did a quick search and could not find a definitive list of which characters are supported by .name
, which would be necessary to add it to the validIdns
list.
I've run the update script at this point, and that's on master; can you test, please, @adamlundrigan ?
@weierophinney that didn't fix it. The issue isn't that it's missing from the list of valid TLDs, it's that we don't have it listed in the other list in this class (validIdns
) which identifies the domains supporting IDNA.
If we can take this page on OpenSRS as authoritative source on the subject, then .name
supports the same range of non-ASCII characters as .com
. That page also puts .cc
and .tv
in the same "category", and those are also missing from the validIdns
list in this class.
@adamlundrigan Would you be willing to create a pull request to add these, then, since you seem to know what needs to be done? 😄
Certainly 👍
@weierophinney IANA publishes a "Repository of IDN Practices" which lists all valid characters for each IDN. I can update the bin/update_hostname_validator.php
script to spider those lists and compile the correct character ranges for each IDN.
If a change to validIdns
causes a domain which was previously accepted by the Hostname validator to no longer be accepted, is that considered a BC break? IMO it is, so for v2.x releases we'd need to retain the existing validIdns
list as-is and append the auto-generated character lists to those. For v3 we can drop the manually-curated list.
@adamlundrigan That sounds like a great approach!
The script won't be pretty but it will work ;)
How do you think we should store the data? I see two options:
-
Importer converts each "table" (eg: https://www.iana.org/domains/idn-tables/tables/academy_hebr_1.0.txt) into a PHP file which returns an array of accepted code point regexes (similar to how it's done now). The Hostname validator is updated to include the appropriate tables for each domain.
-
Build a single PHP file and for each "table" create a PHP array variable containing the code points, then return from the file a map with the TLD as the key and the array of code point regexes as the value. Many TLDs share the same tables (eg: it looks like every "Donuts Inc" TLD references the ".academy" table files) so we can strip out a lot of duplication by having each of those TLDs reference the same array variable, named according to the table filename.
I'm on the side of (2) since (1) will create a mess of individual PHP files and a ton of include statements inside the Hostname class.
@adamlundrigan I'd opt for 2 as well; fewer includes is better.
Mention me in a comment on the PR once it's ready, or flag me as a reviewer.
I contacted IANA to confirm we could bundle the tables (we can, with attribution) and if they had a machine-readable index of the tables (they don't but would consider adding it). They also warned against using the tables the way we do:
I would caution you to consider the appropriateness of using the tables for the purpose you are pursuing. The IDN table repository is not 100% comprehensive, insofar as it is not mandatory for all TLDs to register their tables in the resource. Therefore, if you use it as a mechanism to whitelist domains that support IDNs, you will find it likely will miss some cases.
Pulling in the tables that are available is still a big improvement over what we have in place currently so I will proceed.
@weierophinney I've been working on this a bit and it's quite a rabbit hole, so I suggest that the 2.10 release goes ahead without this.
Issues encountered:
-
Not all IANA tables are in the same format
-
IANA tables are large (compiling all codepoints into an array and serializing results in a ~25MB JSON file). With extra work the downloader could condense the individual codepoints into ranges to reduce a lot of the bulk.
-
Not all information is encoded into the tables in machine-readable format. For example, character location constraints are included as natural language comments:
# Code points: U+002D (HYPHEN-MINUS)
# Reference: RFC 5891 (sec 4.2.3.1)
# Rules: Label must neither start nor end with U+002D. Label must not have U+002D in both third and fourth position.
I've been working on this a bit and it's quite a rabbit hole, so I suggest that the 2.10 release goes ahead without this.
Okay, will do! I'll shift this to the following bugfix release, if that's okay with you.