v5.0.0 tldextract.extract not working like v4.0.0 on pandas dataframe as
Closed this issue · 1 comments
data = ["https://url1.com","http://url2.com","url3.com"]
df = pd.DataFrame(data, columns=['urls'])
df['extracted'] = df.urls.apply(lambda x: '.'.join(tldextract.extract(x)[1:3])) # add column with top level domain url
df
in v4.0.0 this produces:
index | urls | extracted |
---|---|---|
0 | https://url1\.com | url1.com |
1 | http://url2\.com | url2.com |
2 | url3.com | url3.com |
in v5.0.0 this results in the following error:
TypeError Traceback (most recent call last)
in <cell line: 6>()
4 data = ["https://url1.com/","http://url2.com/","url3.com"]
5 df = pd.DataFrame(data, columns=['urls'])
----> 6 df['extracted'] = df.urls.apply(lambda x: '.'.join(tldextract.extract(x)[1:3])) # add column with top level domain url
7 df
4 frames
/usr/local/lib/python3.10/dist-packages/pandas/core/series.py in apply(self, func, convert_dtype, args, **kwargs)
4769 dtype: float64
4770 """
-> 4771 return SeriesApply(self, func, convert_dtype, args, kwargs).apply()
4772
4773 def _reduce(
/usr/local/lib/python3.10/dist-packages/pandas/core/apply.py in apply(self)
1121
1122 # self.f is Callable
-> 1123 return self.apply_standard()
1124
1125 def agg(self):
/usr/local/lib/python3.10/dist-packages/pandas/core/apply.py in apply_standard(self)
1172 else:
1173 values = obj.astype(object)._values
-> 1174 mapped = lib.map_infer(
1175 values,
1176 f,
/usr/local/lib/python3.10/dist-packages/pandas/_libs/lib.pyx in pandas._libs.lib.map_infer()
in (x)
4 data = ["https://url1.com/","http://url2.com/","url3.com"]
5 df = pd.DataFrame(data, columns=['urls'])
----> 6 df['extracted'] = df.urls.apply(lambda x: '.'.join(tldextract.extract(x)[1:3])) # add column with top level domain url
7 df
TypeError: 'ExtractResult' object is not subscriptable
TypeError: 'ExtractResult' object is not subscriptable
Yes, that is expected in 5.0.0. See the breaking changes in the changelog.
In your case, you might want something like the following.
def my_extract(url: str) -> str:
ext = tldextract.extract(url)
return '.'.join((ext.domain, ext.suffix))
f['extracted'] = df.urls.apply(my_extract)