currentslab/extractnet

crash while parsing wikipedia page

dan100110 opened this issue · 1 comments

Using version 2.0.4, when I'm parsing the following webpage: https://en.wikipedia.org/wiki/Shyster_(expert_system)

the webpage is is retrieved with the following call:

raw_html = requests.get(url).text
results = Extractor().extract(raw_html)

running this code results in the following error:

`tests/utils.py:47: in get_content_from_url
    results = Extractor().extract(raw_html)
.venv/lib/python3.8/site-packages/extractnet/pipeline.py:60: in extract
    documents_meta_data = self.extract_one_meta(html)
.venv/lib/python3.8/site-packages/extractnet/pipeline.py:48: in extract_one_meta
    meta_data = extract_metadata(document)
.venv/lib/python3.8/site-packages/extractnet/metadata_extraction/metadata.py:408: in extract_metadata
    metadata['url'] = extract_url(tree, default_url)
.venv/lib/python3.8/site-packages/extractnet/metadata_extraction/metadata.py:334: in extract_url
    url = url_normalizer(parsed_url)
.venv/lib/python3.8/site-packages/extractnet/metadata_extraction/url_utils.py:37: in url_normalizer
    parsed_url = urlparse(url)
/usr/lib/python3.8/urllib/parse.py:375: in urlparse
    url, scheme, _coerce_result = _coerce_args(url, scheme)
/usr/lib/python3.8/urllib/parse.py:127: in _coerce_args
    return _decode_args(args) + (_encode_result,)
/usr/lib/python3.8/urllib/parse.py:111: in _decode_args
    return tuple(x.decode(encoding, errors) if x else '' for x in args)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

.0 = <tuple_iterator object at 0x7fd060558130>

>   return tuple(x.decode(encoding, errors) if x else '' for x in args)
E   AttributeError: 'ParseResult' object has no attribute 'decode'

/usr/lib/python3.8/urllib/parse.py:111: AttributeError```

I believe that the problem is that the metadata for this page does not contain the url, but that is just a guess.

@dan100110 This error was fixed in the latest commit (master) and it will be applied in future versions.

If you don't need the metadata now, you can pass metadata_mining=False in the extract function, which will prevent this problem.

raw_html = requests.get(url).text
results = Extractor().extract(raw_html, metadata_mining=False)