john-kurkowski/tldextract

v5.0.0 tldextract.extract not working like v4.0.0 on pandas dataframe as

Closed this issue · 1 comments

data = ["https://url1.com","http://url2.com","url3.com"]
df = pd.DataFrame(data, columns=['urls'])
df['extracted'] = df.urls.apply(lambda x: '.'.join(tldextract.extract(x)[1:3])) # add column with top level domain url
df

in v4.0.0 this produces:

index urls extracted
0 https://url1\.com url1.com
1 http://url2\.com url2.com
2 url3.com url3.com

in v5.0.0 this results in the following error:


TypeError Traceback (most recent call last)
in <cell line: 6>()
4 data = ["https://url1.com/","http://url2.com/","url3.com"]
5 df = pd.DataFrame(data, columns=['urls'])
----> 6 df['extracted'] = df.urls.apply(lambda x: '.'.join(tldextract.extract(x)[1:3])) # add column with top level domain url
7 df

4 frames
/usr/local/lib/python3.10/dist-packages/pandas/core/series.py in apply(self, func, convert_dtype, args, **kwargs)
4769 dtype: float64
4770 """
-> 4771 return SeriesApply(self, func, convert_dtype, args, kwargs).apply()
4772
4773 def _reduce(

/usr/local/lib/python3.10/dist-packages/pandas/core/apply.py in apply(self)
1121
1122 # self.f is Callable
-> 1123 return self.apply_standard()
1124
1125 def agg(self):

/usr/local/lib/python3.10/dist-packages/pandas/core/apply.py in apply_standard(self)
1172 else:
1173 values = obj.astype(object)._values
-> 1174 mapped = lib.map_infer(
1175 values,
1176 f,

/usr/local/lib/python3.10/dist-packages/pandas/_libs/lib.pyx in pandas._libs.lib.map_infer()

in (x)
4 data = ["https://url1.com/","http://url2.com/","url3.com"]
5 df = pd.DataFrame(data, columns=['urls'])
----> 6 df['extracted'] = df.urls.apply(lambda x: '.'.join(tldextract.extract(x)[1:3])) # add column with top level domain url
7 df

TypeError: 'ExtractResult' object is not subscriptable

TypeError: 'ExtractResult' object is not subscriptable

Yes, that is expected in 5.0.0. See the breaking changes in the changelog.

In your case, you might want something like the following.

def my_extract(url: str) -> str:
    ext = tldextract.extract(url)
    return '.'.join((ext.domain, ext.suffix))

f['extracted'] = df.urls.apply(my_extract)