IPv6 addresses are not handled
Closed this issue · 6 comments
For URLs using IPv4 addresses, the host address gets extracted correctly using .domain
but .fqdn
gives the empty string:
>>> tldextract.extract("https://127.0.0.1:1234/foobar").domain
'127.0.0.1'
>>> tldextract.extract("https://127.0.0.1:1234/foobar").fqdn
''
For URLs using IPv6 addresses, neither method extracts the host address correctly:
>>> tldextract.extract("https://[::1]:1234/foobar").domain
'['
>>> tldextract.extract("https://[::1]:1234/foobar").fqdn
''
>>> tldextract.extract("https://[FEC0:0000:0000:0000:0000:0000:0000:0001]:1234/foobar").domain
'[FEC0'
>>> tldextract.extract("https://[FEC0:0000:0000:0000:0000:0000:0000:0001]:1234/foobar").fqdn
''
This was tested using tldextract
version 3.2.1 on Python 3.9.12
Agreed that those IPv6 outputs are garbage, like returning '['
or '[FEC0'
for the domain. IPv6 has come up a couple times before. This library supports IPv4. It seems not too much extra work to support IPv6. I'm for it. 👍
Re: IPv4, your quoted output is intended.
Think of domain
as the body of the IP address's ExtractResult
; an IP has no subdomain or suffix.
fqdn
is a computed property, "if there is a proper domain/suffix," or else it's the falsy, empty string you observed. If the extraction is an IP, do you think it makes sense for the computed property to contain some other (fully qualified domain) "name" for a number?
Ah, I was just naively assuming that "domain" < "fully-qualified domain". My bad for taking a random code snippet from SE and not really reading the docstring 😅.
I did read the README, so maybe adding the examples there would have helped in my case…
I guess I wouldn't change how fqdn
behaves in the case of IPs either. It's probably best to think of localhost
as interchangeable with IPv4 127.0.0.1
or IPv6 [::1]
, so the empty string would be right here too.
What I needed in my use case was a way to get just the hostname (without user credentials or port number) that would work for all URLs, and fqdn
kinda looked like what I wanted. Currently I'm doing post-processing on the result of urllib.parse.urlsplit
instead.
Reading the code, netloc
seems like what I was looking for, but I'm not sure if exposing it as a property is really in the mission statement for this library.
I hear you on the README examples. 👍
What I needed in my use case was a way to get just the hostname (without user credentials or port number) that would work for all URLs … Currently I'm doing post-processing on the result of
urllib.parse.urlsplit
instead.
Yup, urllib.parse.urlsplit(some_url).netloc
is what you're looking for in the standard library. In this library, domain
is equivalent for IPv4 (or, more generally, rejoining the entire tuple). Just need a fix for IPv6 to be treated the same.