Unstructured-IO/unstructured

bug/AttributeError: 'lxml.etree._ProcessingInstruction' object has no attribute 'is_phrasing'

Closed this issue · 1 comments

Describe the bug
I get AttributeError: 'lxml.etree._ProcessingInstruction' object has no attribute 'is_phrasing' while parsing a web page.

To Reproduce

Running the following code results in an AttributeError:

from unstructured.partition.html import partition_html
import base64

encoded_url = "aHR0cHM6Ly9hdmFuZWVyaGVhbHRoLmNvbS9ibG9nL2dhaW5pbmctY292ZXJhZ2UtaW5zaWdodHMtYXMtYS1wYXRoLXRvLXBheW1lbnQtaW50ZWdyaXR5Lw=="
decoded_url = base64.b64decode(encoded_url).decode("utf-8")

headers = {
    # Provide a User-Agent to avoid getting blocked as a scraper
    "user-agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/128.0.0.0 Safari/537.36",
}

partition_html(
    url=decoded_url,
    headers=headers,
)

I obscured the URL in base64 just because it's a site belonging to my employer and I didn't want to raise any concerns of self-promotion, you can base64-decode it before trying.

Here is the error:

Traceback (most recent call last):
  File "/Path/to/my/project/src/unstructured-issue.py", line 12, in <module>
    partition_html(
  File "/Path/to/my/project/src/.venv/lib/python3.12/site-packages/unstructured/documents/elements.py", line 605, in wrapper
    elements = func(*args, **kwargs)
               ^^^^^^^^^^^^^^^^^^^^^
  File "/Path/to/my/project/src/.venv/lib/python3.12/site-packages/unstructured/file_utils/filetype.py", line 731, in wrapper
    elements = func(*args, **kwargs)
               ^^^^^^^^^^^^^^^^^^^^^
  File "/Path/to/my/project/src/.venv/lib/python3.12/site-packages/unstructured/file_utils/filetype.py", line 687, in wrapper
    elements = func(*args, **kwargs)
               ^^^^^^^^^^^^^^^^^^^^^
  File "/Path/to/my/project/src/.venv/lib/python3.12/site-packages/unstructured/chunking/dispatch.py", line 74, in wrapper
    elements = func(*args, **kwargs)
               ^^^^^^^^^^^^^^^^^^^^^
  File "/Path/to/my/project/src/.venv/lib/python3.12/site-packages/unstructured/partition/html/partition.py", line 103, in partition_html
    elements = list(
               ^^^^^
  File "/Path/to/my/project/src/.venv/lib/python3.12/site-packages/unstructured/partition/lang.py", line 475, in apply_lang_metadata
    elements = list(elements)
               ^^^^^^^^^^^^^^
  File "/Path/to/my/project/src/.venv/lib/python3.12/site-packages/unstructured/partition/html/partition.py", line 222, in iter_elements
    yield from cls(opts)._iter_elements()
  File "/Path/to/my/project/src/.venv/lib/python3.12/site-packages/unstructured/partition/html/partition.py", line 229, in _iter_elements
    for e in self._main.iter_elements():
  File "/Path/to/my/project/src/.venv/lib/python3.12/site-packages/unstructured/partition/html/parser.py", line 361, in iter_elements
    yield from self._element_from_text_or_tail(block_item.tail or "", q)
  File "/Path/to/my/project/src/.venv/lib/python3.12/site-packages/unstructured/partition/html/parser.py", line 377, in _element_from_text_or_tail
    for node in self._iter_text_segments(text, q):
  File "/Path/to/my/project/src/.venv/lib/python3.12/site-packages/unstructured/partition/html/parser.py", line 421, in _iter_text_segments
    while q and q[0].is_phrasing:
                ^^^^^^^^^^^^^^^^
AttributeError: 'lxml.etree._ProcessingInstruction' object has no attribute 'is_phrasing'

Expected behavior
I'd like Unstructured to scrape the text from this web page. It works great everywhere else I've tried so far, just this site seems to be a problem.

Screenshots
N/A

Environment Info

OS version:  macOS-14.6.1-arm64-arm-64bit
Python version:  3.12.4
unstructured version:  0.15.12
unstructured-inference is not installed
pytesseract is not installed
Torch is not installed
Detectron2 is not installed

Additional context
I found this other issue which has the exact same AttributeError, but it seems to be for a different concern.

Thank you very much for your time! I've been using unstructured for about a year now. Happy for any workarounds or to try anything out to help resolve this issue.

My bad, it's the exact same issue described in the third bullet point in #3578: <?xml version='1.0' encoding='UTF-8'?> line breaks the parser. My workaround for now is just to remove that text after fetching and before passing to Unstructured. Looking forward to an eventual fix though! Cheers.