Parsing without PSL still uses PSL for some FQDNs

Question

Parsing without PSL still uses PSL for some FQDNs

Closed this issue 2 years ago · 2 comments

jonmartz commented 3 years ago

Hi, we ran into the following issue:

tldextract.extract('kawasaki.jp', include_psl_private_domains=False) produces ExtractResult(subdomain='', domain='kawasaki', suffix='jp'), which is fine.
tldextract.extract('www.kawasaki.jp', include_psl_private_domains=False) produces ExtractResult(subdomain='', domain='', suffix='www.kawasaki.jp'), which would be expected if include_psl_private_domains=True because "*.kawasaki.jp" is a valid suffix according to the public suffix list. But why is the result the same when this parameter is set to False, given that without the "www" prefix the resulting suffix is only "jp"?

Thanks!

Answer 1 · 2022-04-11T18:03:05.000Z

without PSL

I think you're reading the include_psl_private_domains parameter as controlling whether this project uses the PSL to parse domains, or uses something else to parse domains. The parameter actually controls whether PSL private domains are distinguished from public domains. See this section of the README. The rule for kawasaki.jp occurs early in the list, so it is not in the private domain section, and wouldn't be affected by the parameter.

Answer 2 · 2022-04-12T10:54:51.000Z

Thank you very much for the quick response, which resolves our issue.