desgeeko/pdfsyntax

ValueError: could not convert string to float: b'5.0.0'

simeoncarstens opened this issue ยท 1 comments

The title says it all ๐Ÿ™‚ When running

python3 -m pdfsyntax overview

on an ebook I downloaded somewhere, I get the error in the title with the following traceback:

Traceback (most recent call last):
  File "/nix/store/x7agqy4zr8na6rc7252avhwppgfylz33-python3-3.10.13/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/nix/store/x7agqy4zr8na6rc7252avhwppgfylz33-python3-3.10.13/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/home/simeon/test/penv/lib/python3.10/site-packages/pdfsyntax/__main__.py", line 4, in <module>
    main()
  File "/home/simeon/test/penv/lib/python3.10/site-packages/pdfsyntax/cli.py", line 26, in main
    overview(args.filename)
  File "/home/simeon/test/penv/lib/python3.10/site-packages/pdfsyntax/cli.py", line 229, in overview
    m = metadata(doc)
  File "/home/simeon/test/penv/lib/python3.10/site-packages/pdfsyntax/api.py", line 124, in metadata
    i = info(doc) or {}
  File "/home/simeon/test/penv/lib/python3.10/site-packages/pdfsyntax/docstruct.py", line 310, in info
    return get_object(doc, info)
  File "/home/simeon/test/penv/lib/python3.10/site-packages/pdfsyntax/docstruct.py", line 137, in get_object
    res = memoize_obj_in_cache(doc.index, doc.data[-1]['fdata'], ref, doc.cache)
  File "/home/simeon/test/penv/lib/python3.10/site-packages/pdfsyntax/docstruct.py", line 108, in memoize_obj_in_cache
    obj = parse_obj(text, i)
  File "/home/simeon/test/penv/lib/python3.10/site-packages/pdfsyntax/objects.py", line 228, in parse_obj
    obj = dedicated_type(text[h:i], t)
  File "/home/simeon/test/penv/lib/python3.10/site-packages/pdfsyntax/objects.py", line 193, in dedicated_type
    return float(text)
ValueError: could not convert string to float: b'5.0.0'

So it erroneously tries to convert 5.5.0 to a float, which of course doesn't work. I can "fix" it with something like

return float(".".join(text.decode("ascii").split(".")[:2]))

, which makes the command run successfully. But that likely won't cover all cases where this can go wrong - I have no idea what kind of strings it might try convert in other edge cases.

Thank you for your bug report!
The issue is that a literal string is cut because it contains parentheses that are not escaped; then the parser tries to interpret the remaining characters as a numeric object because this slice does not start with a parenthesis.
According to the PDF specification, the tokenizer should handle balanced parentheses even if they are not escaped.
I will commit a fix in a few minutes.
Regards