bad parsing

Question

bad parsing

Closed this issue 3 years ago · 3 comments

erpatrik commented 3 years ago

Hi,
found 2 issues while parsing domains.

tldextract.extract("hokkaido.jp")
ExtractResult(subdomain='', domain='', suffix='hokkaido.jp')

tldextract.extract("ketrzyn.pl")
ExtractResult(subdomain='', domain='', suffix='ketrzyn.pl')

Answer 1 · 2021-11-16T23:54:37.000Z

Having the same issue with ne.jp

Not sure if relevant, but ne.jp is actually incorrect, it should be www.ne.jp. Working with a legacy system which strips www from urls. When run on www.ne.jp it works, but that causes other bugs for me.

Answer 2 · 2021-11-17T00:08:32.000Z

These are all suffixes on the public sources list. That's why it's like this.

https://publicsuffix.org/list/public_suffix_list.dat

Answer 3 · 2021-12-01T07:36:50.000Z

@erpatrik, @ShmuelTreiger is correct, the domains you're testing are in the public sufffix list so this library is correctly returning them as such.