Domains with subdomains and dashes fail to parse correctly

Question

Domains with subdomains and dashes fail to parse correctly

mergesort opened this issue a year ago · 2 comments

I'm prototyping using DomainParser to check if a URL is valid, and the library seems pretty perfect for fixing all of the edge cases I've run into when using NSDataDetector and regex to validate URLs.

I'm running into a small issue, one that I think is a bug in the library itself. (If I'm holding it wrong please feel free to tell me, I'm not attached to any in particular approach, I just want to make sure I'm generating the correct results.)

public static func isValidURL(_ urlString: String) -> Bool {
    guard let domainParser = try? DomainParser(quickParsing: false) else { return false }
    guard let url = URL(string: self.validatedURL(fromText: urlString)) else { return false }
    guard let host = url.host(percentEncoded: false) else { return false }

    let parsedDomain = domainParser.parse(host: host)

    return parsedDomain?.domain != nil && parsedDomain?.publicSuffix != nil
}

public static func validatedURL(fromText text: String) -> String {
    if text.lowercased().hasPrefix("http://") || text.lowercased().hasPrefix("https://") {
        return text
    } else {
        return "https://".appending(text)
    }
}

Most URLs I tried seem to work perfectly fine, but there appears to be an issue when parsing domains that have both dashes and exact matching domains in the public_suffix_list.dat file.

Note: I've tried using all the combinations of DomainParser(quickParsing: false), DomainParser(quickParsing: true) url.host(percentEncoded: false), and url.host(percentEncoded: true), to rule that out as a source of error.

Here are a few examples of what works and what doesn't work.

URL: xkcd.com
domain: xkcd.com
suffix: .com
Works as expected: true

URL: xkcd-1.com
domain: xkcd-1.com
suffix: .com
Works as expected: true

URL: cdn.prod.atlassian-dev.net
domain: nil
suffix: cdn.prod.atlassian-dev.net
Works as expected: false

URL: cdn.prod.atlassian-dev.net
domain: atlassian-de.net
suffix: .net
Works as expected: true

URL: s3.us-east-2.amazonaws.com
domain: nil
suffix: s3.us-east-2.amazonaws.com
Works as expected: false

URL: s4.us-east-2.amazonaws.com
domain: amazonaws.com
suffix: .com
Works as expected: true

As you can see in the cdn.prod.atlassian-dev.net and s3.us-east-2.amazonaws.com examples the URL is parsed and the entirety of the URL is considered to be the suffix, while the domain is parsed as nil. Even with almost matching URLs work correctly, as long as they are not in the public_suffix_list.dat file.

I hope that helps explain the issue I'm encountering with some breadth, I know firsthand this isn't an easy problem to solve so please let me know if there's any more information I can help provide!

Answer 1 · 2023-09-12T21:26:20.000Z

Hi, @mergesort !

I think there's a misunderstanding on the interpretation of the Public Suffix List. The PSL contains only valid suffixes. It doesn't contain any domains, since they are derived from the suffixes themselves. By this I mean that the PSL lists for example co.uk, but not google.co.uk. There would be no need to do that since the co.uk rule listed in the PSL means that *anything*.co.uk is a domain, including google.co.uk. But just co.uk alone is not a valid domain and therefore is not a valid URL.

Your particular examples, namely cdn.prod.atlassian-dev.net and s3.us-east-2.amazonaws.com are listed in the PSL, which means that they are valid suffixes, but not valid domains. They are in the same category as co.uk, rather than google.co.uk. A valid domain in this case would be something like domain.cdn.prod.atlassian-dev.net or anything.s3.us-east-2.amazonaws.com (and a hostname could have more subdomain labels to the left of these of course). So the library seems to work correctly in your examples by telling you that those are not valid URLs.

I hope that this has been of some help and makes sense!

Answer 2 · 2023-09-13T03:56:09.000Z

Hey @Bogdanisar. I hadn't considered that these domains could only be used as suffixes, but if that's the case then that makes complete sense!

As long as users shouldn't be able to use those suffixes as bare domains then I don't have any issues and will close out this issue, thank you for the thorough explanation! 🙂