bug/AttributeError: 'lxml.etree._ProcessingInstruction' object has no attribute 'is_phrasing'
Closed this issue · 1 comments
Describe the bug
I get AttributeError: 'lxml.etree._ProcessingInstruction' object has no attribute 'is_phrasing'
while parsing a web page.
To Reproduce
Running the following code results in an AttributeError:
from unstructured.partition.html import partition_html
import base64
encoded_url = "aHR0cHM6Ly9hdmFuZWVyaGVhbHRoLmNvbS9ibG9nL2dhaW5pbmctY292ZXJhZ2UtaW5zaWdodHMtYXMtYS1wYXRoLXRvLXBheW1lbnQtaW50ZWdyaXR5Lw=="
decoded_url = base64.b64decode(encoded_url).decode("utf-8")
headers = {
# Provide a User-Agent to avoid getting blocked as a scraper
"user-agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/128.0.0.0 Safari/537.36",
}
partition_html(
url=decoded_url,
headers=headers,
)
I obscured the URL in base64 just because it's a site belonging to my employer and I didn't want to raise any concerns of self-promotion, you can base64-decode it before trying.
Here is the error:
Traceback (most recent call last):
File "/Path/to/my/project/src/unstructured-issue.py", line 12, in <module>
partition_html(
File "/Path/to/my/project/src/.venv/lib/python3.12/site-packages/unstructured/documents/elements.py", line 605, in wrapper
elements = func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/Path/to/my/project/src/.venv/lib/python3.12/site-packages/unstructured/file_utils/filetype.py", line 731, in wrapper
elements = func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/Path/to/my/project/src/.venv/lib/python3.12/site-packages/unstructured/file_utils/filetype.py", line 687, in wrapper
elements = func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/Path/to/my/project/src/.venv/lib/python3.12/site-packages/unstructured/chunking/dispatch.py", line 74, in wrapper
elements = func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/Path/to/my/project/src/.venv/lib/python3.12/site-packages/unstructured/partition/html/partition.py", line 103, in partition_html
elements = list(
^^^^^
File "/Path/to/my/project/src/.venv/lib/python3.12/site-packages/unstructured/partition/lang.py", line 475, in apply_lang_metadata
elements = list(elements)
^^^^^^^^^^^^^^
File "/Path/to/my/project/src/.venv/lib/python3.12/site-packages/unstructured/partition/html/partition.py", line 222, in iter_elements
yield from cls(opts)._iter_elements()
File "/Path/to/my/project/src/.venv/lib/python3.12/site-packages/unstructured/partition/html/partition.py", line 229, in _iter_elements
for e in self._main.iter_elements():
File "/Path/to/my/project/src/.venv/lib/python3.12/site-packages/unstructured/partition/html/parser.py", line 361, in iter_elements
yield from self._element_from_text_or_tail(block_item.tail or "", q)
File "/Path/to/my/project/src/.venv/lib/python3.12/site-packages/unstructured/partition/html/parser.py", line 377, in _element_from_text_or_tail
for node in self._iter_text_segments(text, q):
File "/Path/to/my/project/src/.venv/lib/python3.12/site-packages/unstructured/partition/html/parser.py", line 421, in _iter_text_segments
while q and q[0].is_phrasing:
^^^^^^^^^^^^^^^^
AttributeError: 'lxml.etree._ProcessingInstruction' object has no attribute 'is_phrasing'
Expected behavior
I'd like Unstructured to scrape the text from this web page. It works great everywhere else I've tried so far, just this site seems to be a problem.
Screenshots
N/A
Environment Info
OS version: macOS-14.6.1-arm64-arm-64bit
Python version: 3.12.4
unstructured version: 0.15.12
unstructured-inference is not installed
pytesseract is not installed
Torch is not installed
Detectron2 is not installed
Additional context
I found this other issue which has the exact same AttributeError, but it seems to be for a different concern.
Thank you very much for your time! I've been using unstructured for about a year now. Happy for any workarounds or to try anything out to help resolve this issue.
My bad, it's the exact same issue described in the third bullet point in #3578: <?xml version='1.0' encoding='UTF-8'?>
line breaks the parser. My workaround for now is just to remove that text after fetching and before passing to Unstructured. Looking forward to an eventual fix though! Cheers.