HazyResearch/pdftotree

TypeError: unsupported operand type(s) for +: 'PDFObjRef' and 'bytes'

adarsa opened this issue · 1 comments

When calling pdftotree.parse(pdf_file), i get the following error:

>>> output = pdftotree.parse('403000541.pdf')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/adarsa/ilimi/pyenv36/lib/python3.6/site-packages/pdftotree/core.py", line 63, in parse
    if not extractor.is_scanned():
  File "/Users/adarsa/ilimi/pyenv36/lib/python3.6/site-packages/pdftotree/TreeExtract.py", line 121, in is_scanned
    self.parse()
  File "/Users/adarsa/ilimi/pyenv36/lib/python3.6/site-packages/pdftotree/TreeExtract.py", line 91, in parse
    for page_num, layout in enumerate(analyze_pages(self.pdf_file)):
  File "/Users/adarsa/ilimi/pyenv36/lib/python3.6/site-packages/pdftotree/utils/pdf/pdf_utils.py", line 136, in analyze_pages
    interpreter.process_page(page)
  File "/Users/adarsa/ilimi/pyenv36/lib/python3.6/site-packages/pdfminer/pdfinterp.py", line 841, in process_page
    self.render_contents(page.resources, page.contents, ctm=ctm)
  File "/Users/adarsa/ilimi/pyenv36/lib/python3.6/site-packages/pdfminer/pdfinterp.py", line 852, in render_contents
    self.init_resources(resources)
  File "/Users/adarsa/ilimi/pyenv36/lib/python3.6/site-packages/pdfminer/pdfinterp.py", line 356, in init_resources
    self.fontmap[fontid] = self.rsrcmgr.get_font(objid, spec)
  File "/Users/adarsa/ilimi/pyenv36/lib/python3.6/site-packages/pdfminer/pdfinterp.py", line 204, in get_font
    font = self.get_font(None, subspec)
  File "/Users/adarsa/ilimi/pyenv36/lib/python3.6/site-packages/pdfminer/pdfinterp.py", line 195, in get_font
    font = PDFCIDFont(self, spec)
  File "/Users/adarsa/ilimi/pyenv36/lib/python3.6/site-packages/pdfminer/pdffont.py", line 641, in __init__
    self.cidcoding = (self.cidsysteminfo.get('Registry', 'unknown') + b'-' +
TypeError: unsupported operand type(s) for +: 'PDFObjRef' and 'bytes'

I installed by building the package in python 3.6.

I looked for the following line of code both in pdfminer and in pdfminer.six, and found that it only appears in pdfminer.

File "/Users/adarsa/ilimi/pyenv36/lib/python3.6/site-packages/pdfminer/pdffont.py", line 641, in init
self.cidcoding = (self.cidsysteminfo.get('Registry', 'unknown') + b'-' +

This issue was reported at euske/pdfminer#258 too (the original pdfminer), which was supposedly fixed by euske/pdfminer@cc7d409.

Assuming pdfminer (instead of pdfminer.six) was used by the reporter, this issue is invalid as the pdftotree has listed pdfminer.six as a dependency at 8c190cb, way before this issue was reported, and still uses it.