Some functions to parse and normalize URLs.
The main focus of this library is to make it possible to work on all segments of an URL. Thus a core feature (which is not provided by stdlib) is to split a domain name correctly by using the Public Suffix List (see below).
>>> urltools.normalize("Http://exAMPLE.com./foo")
http://example.com/foo
Rules that are applied to normalize a URL:
- tolower scheme
- tolower host (also works with IDNs)
- remove default port
- remove ':' without port
- remove DNS root label
- unquote path, query, fragment
- collapse path (remove '//', '/./', '/../')
- sort query params and remove params without value
normalize
uses the functions for splitting and normalization which are
descriped below. The hostname is not tolowered by normalize_host
. It is already
done in the split_host
step before to make splitting of malformed netlocs
easier.
The result of parse
and extract
is a URL
named tuple that contains
scheme
, username
, password
, subdomain
, domain
, tld
, port
, path
,
query
, fragment
and the original url
itself.
>>> urltools.parse("http://example.co.uk/foo/bar?x=1#abc")
URL(scheme='http', username='', password='', subdomain='', domain='example',
tld='co.uk', port='', path='/foo/bar', query='x=1', fragment='abc',
url='http://example.co.uk/foo/bar?x=1#abc')
If the scheme
is missing parse
interprets the URL as relative.
>>> urltools.parse("www.example.co.uk/abc")
URL(scheme='', username='', password='', subdomain='', domain='', tld='',
port='', path='www.example.co.uk/abc', query='', fragment='',
url='www.example.co.uk/abc')
extract
does not care about relative URLs and always tries to extract as much
information as possible.
>>> urltools.extract("www.example.co.uk/abc")
URL(scheme='', username='', password='', subdomain='www', domain='example',
tld='co.uk', port='', path='/abc', query='', fragment='',
url='www.example.co.uk/abc')
Besides the already described main functions urltools
has some more functions
to manipulate segments of a URL or create new URLs.
-
construct
a new URL from parts>>> construct(URL('http', '', '', '', 'example', 'com', '/abc', 'x=1', ... 'foo', None)) 'http://example.com/abc?x=1#foo'
-
compare
two urls to check if they are the same>>> compare("http://examPLe.com:80/abc?x=&b=1", ... "http://eXAmple.com/abc?b=1") True
-
encode
(IDNA, see RFC 3490)>>> urltools.encode("http://müller.de") 'http://xn--mller-kva.de/'
-
normalize_host
decodes IDNA encoded segments of a DNS name>>> normalize_host('xn--mller-kva.de') u'müller.de' >>> normalize_host('xn--e1afmkfd.xn--p1ai') u'пример.рф'
-
normalize_path
>>> normalize_path("/a/b/../../c") '/c'
-
normalize_query
>>> normalize_query("x=1&y=&z=3") 'x=1&z=3'
-
normalize_fragment
unquotes fragments except for the characters+#
and space -
unquote
a string. Optional it's possible to specify a list of characters which are not unquoted>>> unquote('foo%23bar') 'foo#bar' >>> unquote('foo%23bar', ['#']) 'foo%23bar'
-
split
is basically the same asurlparse.urlparse
in Python2.7 orurllib.parse.urlparse
in Python3.4. In Python2.7 it handles some malformed URLs better thanurlparse
. Differences tourlparse
in Python3.4 were not analyzed.>>> split("http://www.example.com/abc?x=1&y=2#foo") SplitResult(scheme='http', netloc='www.example.com', path='/abc', query='x=1&y=2', fragment='foo')
-
split_netloc
splits a network location (netloc) to username, password, host and port>>> split_netloc("foo:bar@www.example.com:8080") ('foo', 'bar', 'www.example.com', '8080')
-
split_host
uses the Public Suffix List to split a domain name correctly>>> split_host("www.example.ac.at") ('www', 'example', 'ac.at')
urltools
uses the Public Suffix List (PSL) to split domain names correctly.
E.g. the TLD of example.co.uk
would be .co.uk
and not .uk
. It is not
possible to decide "how big" the TLD is without a lookup in this list.
A local copy of the PSL is recommended. Otherwise it is downloaded with each
import of urltools
. The path of the local copy has to be defined in the env
variable PUBLIC_SUFFIX_LIST
:
export PUBLIC_SUFFIX_LIST=/path/to/effective_tld_names.dat
For more information about how PSL works see http://publicsuffix.org/
You can install urltools
from the Python Package Index (PyPI):
pip install urltools
... or get the latest version directly from GitHub:
pip install -e git://github.com/rbaier/python-urltools.git#egg=urltools
The second option is not recommended because some features might be in an experimental state.
There is (or should be) a git tag for each version that was released on PyPI.
tox and pytest are used for testing. Simply install tox and run it:
pip install tox
tox