crash while parsing wikipedia page
dan100110 opened this issue · 1 comments
dan100110 commented
Using version 2.0.4, when I'm parsing the following webpage: https://en.wikipedia.org/wiki/Shyster_(expert_system)
the webpage is is retrieved with the following call:
raw_html = requests.get(url).text
results = Extractor().extract(raw_html)
running this code results in the following error:
`tests/utils.py:47: in get_content_from_url
results = Extractor().extract(raw_html)
.venv/lib/python3.8/site-packages/extractnet/pipeline.py:60: in extract
documents_meta_data = self.extract_one_meta(html)
.venv/lib/python3.8/site-packages/extractnet/pipeline.py:48: in extract_one_meta
meta_data = extract_metadata(document)
.venv/lib/python3.8/site-packages/extractnet/metadata_extraction/metadata.py:408: in extract_metadata
metadata['url'] = extract_url(tree, default_url)
.venv/lib/python3.8/site-packages/extractnet/metadata_extraction/metadata.py:334: in extract_url
url = url_normalizer(parsed_url)
.venv/lib/python3.8/site-packages/extractnet/metadata_extraction/url_utils.py:37: in url_normalizer
parsed_url = urlparse(url)
/usr/lib/python3.8/urllib/parse.py:375: in urlparse
url, scheme, _coerce_result = _coerce_args(url, scheme)
/usr/lib/python3.8/urllib/parse.py:127: in _coerce_args
return _decode_args(args) + (_encode_result,)
/usr/lib/python3.8/urllib/parse.py:111: in _decode_args
return tuple(x.decode(encoding, errors) if x else '' for x in args)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
.0 = <tuple_iterator object at 0x7fd060558130>
> return tuple(x.decode(encoding, errors) if x else '' for x in args)
E AttributeError: 'ParseResult' object has no attribute 'decode'
/usr/lib/python3.8/urllib/parse.py:111: AttributeError```
I believe that the problem is that the metadata for this page does not contain the url, but that is just a guess.
theblackcat102 commented
@dan100110 This error was fixed in the latest commit (master) and it will be applied in future versions.
If you don't need the metadata now, you can pass metadata_mining=False in the extract function, which will prevent this problem.
raw_html = requests.get(url).text
results = Extractor().extract(raw_html, metadata_mining=False)