nlmatics/llmsherpa

Getting a json parse exception when trying to use the `content=` parameter of LayoutPDFReader.read_pdf() with a None value.

Opened this issue · 1 comments

Steps taken:

from llmsherpa.readers import LayoutPDFReader
from pathlib import Path

llmsherpa_api_url = "https://readers.llmsherpa.com/api/document/developer/parseDocument?renderFormat=all"
parser= LayoutPDFReader(llmsherpa_api_url)
path = Path('yp') / 'tests' / 'content' / 'Ambrx EX-2.1.pdf'
with open(path, 'rb') as f:
    content = f.read()
parser.read_pdf(None, content)

resulting stack trace:

Traceback (most recent call last):
  File "/home/mboyd/.pycharm_helpers/pydev/pydevconsole.py", line 364, in runcode
    coro = func()
           ^^^^^^
  File "<input>", line 1, in <module>
  File "/home/mboyd/.virtualenvs/yp-demo/lib/python3.12/site-packages/llmsherpa/readers/file_reader.py", line 72, in read_pdf
    response_json = json.loads(parser_response.data.decode("utf-8"))
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/mboyd/.pyenv/versions/3.12.1/lib/python3.12/json/__init__.py", line 346, in loads
    return _default_decoder.decode(s)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/mboyd/.pyenv/versions/3.12.1/lib/python3.12/json/decoder.py", line 337, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/mboyd/.pyenv/versions/3.12.1/lib/python3.12/json/decoder.py", line 355, in raw_decode
    raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

However, parser.read_pdf('', contents=content) DOES successfully parse, as an empty string evaluates to false and cleanly converts to valid JSON in _parse_pdf(), unlike None. None would be the normal pythonic way of specifying no value, however.

Ran into this as well. Wrote my own class because of it.
My workaround:

class Parser:
	...
    def _post_request(self, contents: bytes, mime_type: str) -> dict[str, Any]:
        parser_response = self.session.post(
            self.uri,
            files={"file": ("", contents, mime_type)},
        )
        parser_response.raise_for_status()
        response = json.loads(parser_response.text)
        return response.get("return_dict").get("result")

The bonus was to also deal with path_or_file not being nullable in parser.read_pdf.