john-kurkowski/tldextract

Performance Issues after Upgrade from 3.4.1 to 3.4.4

Closed this issue · 2 comments

Hi,

I am very sure that this is 100% our fault, but I just wanted to also make sure, that this is not also effecting people who are actually using this package in the right way.

We had some performance issues and after a long investigation involving pip freeze we realized that it came from an TLDExtract upgrade from 3.4.1 to 3.4.4.

We were using tldextract like this in a function:

def get_domain(url: str):
    extractor = tldextract.TLDExtract(
        cache_dir=None, suffix_list_urls=[], fallback_to_snapshot=True
    )
    ext = extractor(url)
    return "{}.{}".format(ext.domain, ext.suffix)

We have quite a few calls to this function per second and for some reason this slowed down significantly after a recent upgrade from 3.4.1 to 3.4.4.

I tried to do some profiling:

I left this running for roughly the same amount of time, but it is unlikely that the method above was called the same amount of time, because this was slowing down the whole program.

3.4.1 (no issues):
image

3.4.4 (issue):
image

As a fix I just took the callable constructor out of that function and now the performance is way better then before even:

extractor = tldextract.TLDExtract(
        cache_dir=None, suffix_list_urls=[], fallback_to_snapshot=True
    )
def get_domain(url: str):
    ext = extractor(url)
    return "{}.{}".format(ext.domain, ext.suffix)

I realize this was probably just bad practice from our side, but as said before I just wanted to make sure there is really nothing of interest here. Feel free to close this.

Hi,

The slowdown is likely due to the constructor initializing suffix Tries. This was implemented in v3.4.3 to speed up the extraction.

        self.tlds_incl_private = frozenset(public_tlds + private_tlds + extra_tlds)
        self.tlds_excl_private = frozenset(public_tlds + extra_tlds)
        self.tlds_incl_private_trie = Trie.create(self.tlds_incl_private)
        self.tlds_excl_private_trie = Trie.create(self.tlds_excl_private)

Your solution is correct since you are always using the same fallback snapshot, hence calling the constructor once and reusing the extractor is recommended practice.


A sidenote on the following line

return "{}.{}".format(ext.domain, ext.suffix)

get_domain() would return nonexistenttld. for example.nonexistenttld (example parsed as subdomain, nonexistenttld parsed as domain)
and .co.uk for co.uk (blank string passed as domain)

Would suggest using ext.registered_domain instead.

OK thank you! That is all very good to know. And thank you also for your feedback!