Get the port from the provided URL to extract function
Closed this issue · 3 comments
Hi!
I was wondering if it is possible to get the port from the URLs when extract
function is invoked (or other function). I guess it is not, since I didn't see it in the documentation and I've dug a little bit in the code and I didn't see anything related. I'm using this library in order to obtain URLs from a large list, and use those URLs in order to crawl, so I need the port in case it is defined. In case it is not possible to obtain the port, it is intended to implement this functionality?
>>> tldextract.extract('http://127.0.0.1:8080/deployed/')
ExtractResult(subdomain='', domain='127.0.0.1', suffix='', port='8080')
Thank you!
I took a stab at this in #273. I'm not sold on the solution as is. Feel free to chime in there. In the meantime, I suggest parsing the port with the standard library. Example:
split_url = urllib.parse.urlsplit("https://foo.bar.com:8080")
split_suffix = tldextract.extract(split_url.netloc)
url_to_crawl = f"{split_url.scheme}://{split_suffix.registered_domain}:{split_url.port}"
As of #274, the above workaround can be tweaked slightly to avoid parsing the string twice:
split_url = urllib.parse.urlsplit("https://foo.bar.com:8080")
- split_suffix = tldextract.extract(split_url.netloc)
+ split_suffix = tldextract.extract_urllib(split_url)
url_to_crawl = f"{split_url.scheme}://{split_suffix.registered_domain}:{split_url.port}"
After thinking about it, this library is focused on domain names, not every component of a URL. I defer URL parsing to Python's standard lib. I hope the workaround in the previous comment helps!