Performance Issues after Upgrade from 3.4.1 to 3.4.4
Closed this issue · 2 comments
Hi,
I am very sure that this is 100% our fault, but I just wanted to also make sure, that this is not also effecting people who are actually using this package in the right way.
We had some performance issues and after a long investigation involving pip freeze we realized that it came from an TLDExtract upgrade from 3.4.1 to 3.4.4.
We were using tldextract like this in a function:
def get_domain(url: str):
extractor = tldextract.TLDExtract(
cache_dir=None, suffix_list_urls=[], fallback_to_snapshot=True
)
ext = extractor(url)
return "{}.{}".format(ext.domain, ext.suffix)
We have quite a few calls to this function per second and for some reason this slowed down significantly after a recent upgrade from 3.4.1 to 3.4.4.
I tried to do some profiling:
I left this running for roughly the same amount of time, but it is unlikely that the method above was called the same amount of time, because this was slowing down the whole program.
As a fix I just took the callable constructor out of that function and now the performance is way better then before even:
extractor = tldextract.TLDExtract(
cache_dir=None, suffix_list_urls=[], fallback_to_snapshot=True
)
def get_domain(url: str):
ext = extractor(url)
return "{}.{}".format(ext.domain, ext.suffix)
I realize this was probably just bad practice from our side, but as said before I just wanted to make sure there is really nothing of interest here. Feel free to close this.
Hi,
The slowdown is likely due to the constructor initializing suffix Tries. This was implemented in v3.4.3 to speed up the extraction.
self.tlds_incl_private = frozenset(public_tlds + private_tlds + extra_tlds)
self.tlds_excl_private = frozenset(public_tlds + extra_tlds)
self.tlds_incl_private_trie = Trie.create(self.tlds_incl_private)
self.tlds_excl_private_trie = Trie.create(self.tlds_excl_private)
Your solution is correct since you are always using the same fallback snapshot, hence calling the constructor once and reusing the extractor is recommended practice.
A sidenote on the following line
return "{}.{}".format(ext.domain, ext.suffix)
get_domain() would return nonexistenttld.
for example.nonexistenttld
(example
parsed as subdomain, nonexistenttld
parsed as domain)
and .co.uk
for co.uk
(blank string passed as domain)
Would suggest using ext.registered_domain instead.
OK thank you! That is all very good to know. And thank you also for your feedback!