john-kurkowski/tldextract

problems recognizing some TLD suffixes that seem to be present in the data files?

Closed this issue · 4 comments

Hi, I noticed that a few suffixes that seem to be in the .tld_set_snapshot distributed with v3.1.0 and also present in the current https://publicsuffix.org/list/public_suffix_list.dat file that gets downloaded seem to not be recognized as tld suffixes for some reason.

The two I noticed specifically were co.ca and co.cz. This test snippet I wrote shows this behavior for me (in a pyenv created virtualenv based on python 3.8.9) :

import os
import sys
import logging

# init logging in debug mode to hopefully get a little info out of tldextract:
logging.basicConfig(
    stream=sys.stdout,
    level=logging.DEBUG,
    format="%(asctime)s:%(levelname)s:%(name)s[%(lineno)s]:%(message)s"
)

import tldextract

custom_cache_extract = tldextract.TLDExtract(cache_dir=os.path.join(os.getcwd(), '.tld_set'))
# same behavior for cached vs not:
no_cache_extract = tldextract.TLDExtract(cache_dir=False)

testdata = [
    "co.cz",
    "co.uk",
    "co.ca",
    "co.za"
]

with open(f'{os.path.dirname(tldextract.__file__)}{os.path.sep}.tld_set_snapshot','r') as fd:
    tld_snapshot = [line.rstrip() for line in fd]

for td in testdata:
    tname = f'foo.{td}'
    dparts = custom_cache_extract(tname)
    dparts_nc = no_cache_extract(tname)
    print(f'dparts: {dparts}')
    print(f'dparts_nc: {dparts_nc}')
    print(f'              => is {td} in tld_set_snapshot: {td in tld_snapshot}')
    print(f' (cached)     => is {td} in tlds: {td in custom_cache_extract.tlds}')
    print(f' (not cached) => is {td} in tlds: {td in no_cache_extract.tlds}')

The output:

$ python tldextract_test.py
2021-08-04 18:21:49,119:DEBUG:filelock[270]:Attempting to acquire lock 140194659320688 on /tmp/test/.tld_set/publicsuffix.org-tlds/de84b5ca2167d4c83e38fb162f2e8738.tldextract.json.lock
2021-08-04 18:21:49,120:INFO:filelock[274]:Lock 140194659320688 acquired on /tmp/test/.tld_set/publicsuffix.org-tlds/de84b5ca2167d4c83e38fb162f2e8738.tldextract.json.lock
2021-08-04 18:21:49,120:DEBUG:filelock[270]:Attempting to acquire lock 140194592644832 on /tmp/test/.tld_set/urls/62bf135d1c2f3d4db4228b9ecaf507a2.tldextract.json.lock
2021-08-04 18:21:49,120:INFO:filelock[274]:Lock 140194592644832 acquired on /tmp/test/.tld_set/urls/62bf135d1c2f3d4db4228b9ecaf507a2.tldextract.json.lock
2021-08-04 18:21:49,122:DEBUG:urllib3.connectionpool[971]:Starting new HTTPS connection (1): publicsuffix.org:443
2021-08-04 18:21:49,248:DEBUG:urllib3.connectionpool[452]:https://publicsuffix.org:443 "GET /list/public_suffix_list.dat HTTP/1.1" 200 None
2021-08-04 18:21:49,267:DEBUG:filelock[315]:Attempting to release lock 140194592644832 on /tmp/test/.tld_set/urls/62bf135d1c2f3d4db4228b9ecaf507a2.tldextract.json.lock
2021-08-04 18:21:49,267:INFO:filelock[318]:Lock 140194592644832 released on /tmp/test/.tld_set/urls/62bf135d1c2f3d4db4228b9ecaf507a2.tldextract.json.lock
2021-08-04 18:21:49,277:DEBUG:filelock[315]:Attempting to release lock 140194659320688 on /tmp/test/.tld_set/publicsuffix.org-tlds/de84b5ca2167d4c83e38fb162f2e8738.tldextract.json.lock
2021-08-04 18:21:49,278:INFO:filelock[318]:Lock 140194659320688 released on /tmp/test/.tld_set/publicsuffix.org-tlds/de84b5ca2167d4c83e38fb162f2e8738.tldextract.json.lock
2021-08-04 18:21:49,285:DEBUG:urllib3.connectionpool[971]:Starting new HTTPS connection (1): publicsuffix.org:443
2021-08-04 18:21:49,378:DEBUG:urllib3.connectionpool[452]:https://publicsuffix.org:443 "GET /list/public_suffix_list.dat HTTP/1.1" 200 None
dparts: ExtractResult(subdomain='foo', domain='co', suffix='cz')
dparts_nc: ExtractResult(subdomain='foo', domain='co', suffix='cz')
              => is co.cz in tld_set_snapshot: True
 (cached)     => is co.cz in tlds: False
 (not cached) => is co.cz in tlds: False
dparts: ExtractResult(subdomain='', domain='foo', suffix='co.uk')
dparts_nc: ExtractResult(subdomain='', domain='foo', suffix='co.uk')
              => is co.uk in tld_set_snapshot: True
 (cached)     => is co.uk in tlds: True
 (not cached) => is co.uk in tlds: True
dparts: ExtractResult(subdomain='foo', domain='co', suffix='ca')
dparts_nc: ExtractResult(subdomain='foo', domain='co', suffix='ca')
              => is co.ca in tld_set_snapshot: True
 (cached)     => is co.ca in tlds: False
 (not cached) => is co.ca in tlds: False
dparts: ExtractResult(subdomain='', domain='foo', suffix='co.za')
dparts_nc: ExtractResult(subdomain='', domain='foo', suffix='co.za')
              => is co.za in tld_set_snapshot: True
 (cached)     => is co.za in tlds: True
 (not cached) => is co.za in tlds: True

So, not quite sure if this is something in my environment/setup contributing to this, or a bug in tldextract or what, but figured I should report it and see if you see it as well...

Thanks,
Chuck

I had a thought after I posted this, and I grepped all of the 'co.XX' lines out of the supplied snapshot file and put them all into my test data, and it looks like all the ones earlier in the file are okay but past a certain point all the rest are not, so perhaps there is a file parsing error that causes only part of data file to not be parsed?

I didn't test everything in the file, of course, just stuff that matched '^co\. but the pattern that revealed does seem suspicious to me.

Thanks,
Chuck

Should have mentioned that co.zw was the last one to work as expected, but from co.com on they all failed, which should help narrow down the place something is happening.

Well, was looking into this again and noticed that the BEGIN PRIVATE DOMAINS line occurs between those two, and the code I was investigating that was using this (and hence my testcase above as well) forgot to set include_psl_private_domains=True. Adding that to the testcase above causes all of the look ups to work as expected.

Sorry, closing this issue.

No worries. It's one of the most divisive things about the PSL. It should be a clearer FAQ entry on this repo. Thanks for clear debugging, expected, and actual!