ValueError: could not convert string to float: b'5.0.0'
simeoncarstens opened this issue ยท 1 comments
The title says it all ๐ When running
python3 -m pdfsyntax overview
on an ebook I downloaded somewhere, I get the error in the title with the following traceback:
Traceback (most recent call last):
File "/nix/store/x7agqy4zr8na6rc7252avhwppgfylz33-python3-3.10.13/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/nix/store/x7agqy4zr8na6rc7252avhwppgfylz33-python3-3.10.13/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/home/simeon/test/penv/lib/python3.10/site-packages/pdfsyntax/__main__.py", line 4, in <module>
main()
File "/home/simeon/test/penv/lib/python3.10/site-packages/pdfsyntax/cli.py", line 26, in main
overview(args.filename)
File "/home/simeon/test/penv/lib/python3.10/site-packages/pdfsyntax/cli.py", line 229, in overview
m = metadata(doc)
File "/home/simeon/test/penv/lib/python3.10/site-packages/pdfsyntax/api.py", line 124, in metadata
i = info(doc) or {}
File "/home/simeon/test/penv/lib/python3.10/site-packages/pdfsyntax/docstruct.py", line 310, in info
return get_object(doc, info)
File "/home/simeon/test/penv/lib/python3.10/site-packages/pdfsyntax/docstruct.py", line 137, in get_object
res = memoize_obj_in_cache(doc.index, doc.data[-1]['fdata'], ref, doc.cache)
File "/home/simeon/test/penv/lib/python3.10/site-packages/pdfsyntax/docstruct.py", line 108, in memoize_obj_in_cache
obj = parse_obj(text, i)
File "/home/simeon/test/penv/lib/python3.10/site-packages/pdfsyntax/objects.py", line 228, in parse_obj
obj = dedicated_type(text[h:i], t)
File "/home/simeon/test/penv/lib/python3.10/site-packages/pdfsyntax/objects.py", line 193, in dedicated_type
return float(text)
ValueError: could not convert string to float: b'5.0.0'
So it erroneously tries to convert 5.5.0
to a float, which of course doesn't work. I can "fix" it with something like
return float(".".join(text.decode("ascii").split(".")[:2]))
, which makes the command run successfully. But that likely won't cover all cases where this can go wrong - I have no idea what kind of strings it might try convert in other edge cases.
Thank you for your bug report!
The issue is that a literal string is cut because it contains parentheses that are not escaped; then the parser tries to interpret the remaining characters as a numeric object because this slice does not start with a parenthesis.
According to the PDF specification, the tokenizer should handle balanced parentheses even if they are not escaped.
I will commit a fix in a few minutes.
Regards