fasttld

fasttld is a high performance top level domains (TLD) extraction module based on the compressed trie data structure implemented with the builtin python dict().

Background

The goal of fasttld is to extract top level domains (TLDs) from URLs efficiently. In the other words, we extract com from URLs like www.google.com or https://maps.google.com:8080/a/long/path/?query=42.

Running something like ".".join(domain.split('.')[1::]) is not a viable solution, for example, maps.baidu.com.cn would give us the wrong result baidu.com.cn instead of com.cn.

The fasttld module solves this problem by using the regularly-updated Mozilla Public Suffix List and the trie data structure to efficiently extract subdomains, hostnames, and TLDs from URLs.

fasttld also supports extraction of private domains listed in the Mozilla Public Suffix List like 'blogspot.co.uk' and 'sinaapp.com'.

Installation

You can install fasttld from PyPI.

pip install fasttld

or build from source

git clone https://github.com/jophy/fasttld.git && cd fasttld
python setup.py install

Usage

>>> from fasttld import FastTLDExtract
>>> t = FastTLDExtract()
>>> res = t.extract("https://some-user@a.long.subdomain.ox.ac.uk:5000/a/b/c/d/e/f/g/h/i?id=42")
>>> scheme, userinfo, subdomain, domain, suffix, port, path, domain_name = res
>>> scheme, userinfo, subdomain, domain, suffix, port, path, domain_name
('https://', 'some-user', 'a.long.subdomain', 'ox', 'ac.uk', '5000', 'a/b/c/d/e/f/g/h/i?id=42', 'ox.ac.uk')

extract() returns a tuple (scheme, userinfo, subdomain, domain, suffix, port, path, domain_name) .

Update the Mozilla Public Suffix List local copy

Whenever fasttld is called, it will automatically update the local copy of the Mozilla Public Suffix List if it is more than 3 days old. You can also run the update process manually via the following commands.

>>> import fasttld
>>> fasttld.update()

>>> from fasttld import FastTLDExtract
>>> FastTLDExtract().update()

This option can be disabled setting the environment flag FASTTLD_NO_AUTO_UPDATE to 1.

Specify Mozilla Public Suffix List file

You can also specify your own public suffix list file.

>>> from fasttld import FastTLDExtract
>>> FastTLDExtract(file_path='/path/to/psl/file').extract('domain', subdomain=False)

Disable subdomain output

If you do not need to extract subdomains, you can disable subdomain output with subdomain=False.

>>> from fasttld import FastTLDExtract
>>> FastTLDExtract().extract('domain', subdomain=False) # set subdomain=False

Optional: Exclude private domains

According to the Mozilla.org wiki, the Mozilla Public Suffix List contains private domains like blogspot.co.uk and sinaapp.com because some registered domain owners wish to delegate subdomains to mutually-untrusting parties, and find that being added to the PSL gives their solution more favourable security properties.

By default, fasttld treats private domains as TLDs (i.e. exclude_private_suffix=False)

>>> from fasttld import FastTLDExtract
>>> FastTLDExtract(exclude_private_suffix=False).extract('news.blogspot.co.uk')
>>> ('', '', '', 'news', 'blogspot.co.uk', '', '', 'news.blogspot.co.uk') # blogspot.co.uk is treated as a TLD
>>> FastTLDExtract().extract('news.blogspot.co.uk')  # this is the default behaviour
>>> ('', '', '', 'news', 'blogspot.co.uk', '', '', 'news.blogspot.co.uk') # same output as above

You can instruct fasttld to exclude private domains by setting exclude_private_suffix=True

>>> from fasttld import FastTLDExtract
>>> FastTLDExtract(exclude_private_suffix=True).extract('news.blogspot.co.uk') # set exclude_private_suffix=True
>>> ('', '', 'news', 'blogspot', 'co.uk', '', '', 'blogspot.co.uk') # notice that co.uk is now recognised as the TLD instead of blogspot.co.uk

Speed Comparison

Similar modules include tldextract and tld.

Test conditions

Initialize the module class once, then call its extract function ten million times. Measure the time taken.

Test environment

Python 3.9.12, AMD Ryzen 7 5800X 3.8 GHz 8 cores 16 threads, 48GB RAM

Test results

module\case	`jophy.com`	`www.baidu.com.cn`	`jo.noexist`	`https://maps.google.com.ua/a/long/path?query=42`	`1.1.1.1`	`https://192.168.55.1`
fasttld	7.60s	9.90s	5.28s	5.67s	5.06s	5.30s
tldextract	22.96s	29.32s	25.06s	31.69s	33.89s	35.15s
tld	26.75s	29.00s	23.01s	27.55s	22.79s	22.55s

Excluding subdomains (i.e. subdomain=False)

module\case	`jophy.com`	`www.baidu.com.cn`	`jo.noexist`	`https://maps.google.com.ua/a/long/path?query=42`	`1.1.1.1`	`https://192.168.55.1`
fasttld	7.55s	8.98s	5.20s	5.52s	5.13s	5.25s

On average, fasttld is 4 to 5 times faster than the other modules. It retains its performance advantage even when parsing long URLs like https://maps.google.com.ua/a/long/path?query=42

Acknowledgements

Some code borrowed from the tldextract module