jcushman/pdfquery

Error with annotations

Opened this issue · 3 comments

Found an issue when upgrading from pdfquery 0.2.7 to 0.4.3. Looks like starting in 0.3.0, support for annotations was added. This is what appears to be happening. In the _add_annots() method in pdfquery.py, an annotation object is found by pdfminer. _add_annots() retrieves this object and converts all information into strings (via obj_to_string()). This method is called again and pdfminer returns a cached version of the annotation object, only this time, all the information has been converted into strings by pdfquery. This leads to an error on line 649:

annot['URI'] = resolve1(annot['A'])['URI']

The first time through _add_annots(), resolve1(annot['A']) returns a dict with 'URI' being one of the keys. On the second time through, annot['A'] is a string representation (converted by obj_to_string) of that dict and so the line fails.

I've attached a PDF file (annot.pdf) to show the problem. This file only has one line of text (a company's home page URL) which is being seen as an annotation.

This error has been found with:

  • pdfquery version 0.3.0, 0.4.x
  • pdfminer 20140328
  • python 2.7.1
  • Fedora Linux 23

If there's any other information that would help, let me know.

Do you have example code that reproduces this error? pdf.load() is working for me with your supplied file.

I'm assuming you have a little script to send the file to pdf.load(). Could you attach that? That way, I can run exactly what you did with the same file. If it doesn't work for me, that could indicate something else is causing the issue. If it does work, I can trace the difference between what you did vs what my code is doing. It could be also possible that by removing the private information from the PDF file, I also removed what was causing the problem. It's been so long since I've submitted this that I don't remember. I think I verified that I was still having the issue with the PDF I've attached, but can't remember for sure.

I seem to be getting this error as well. PDF file here. I have also tested on this

  • pdfquery 0.4.3
  • pdfminer 20170720
  • python 3.6
  • OSX 10.12.6

Error message:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-3-6d31003dedab> in <module>()
     13     pdf = pdfquery.PDFQuery("../"+name)
     14     pdf.load()
---> 15     tree = pdf.get_tree()
     16     #tree.write("current.xml", pretty_print=True)
     17 

~/anaconda3/lib/python3.6/site-packages/pdfquery/pdfquery.py in get_tree(self, *page_numbers)
    485                 else:
    486                     pages = enumerate(self.get_layouts())
--> 487                 for n, page in pages:
    488                     page = self._xmlize(page)
    489                     page.set('page_index', obj_to_string(n))

~/anaconda3/lib/python3.6/site-packages/pdfquery/pdfquery.py in <genexpr>(.0)
    606     def get_layouts(self):
    607         """ Get list of PDFMiner Layout objects for each page. """
--> 608         return (self.get_layout(page) for page in self._cached_pages())
    609 
    610     def _cached_pages(self, target_page=-1):

~/anaconda3/lib/python3.6/site-packages/pdfquery/pdfquery.py in get_layout(self, page)
    601         self.interpreter.process_page(page)
    602         layout = self.device.get_result()
--> 603         layout = self._add_annots(layout, page.annots)
    604         return layout
    605 

~/anaconda3/lib/python3.6/site-packages/pdfquery/pdfquery.py in _add_annots(self, layout, annots)
    647                     annot = self._set_hwxy_attrs(annot)
    648                 try:
--> 649                     annot['URI'] = resolve1(annot['A'])['URI']
    650                 except KeyError:
    651                     pass

TypeError: string indices must be integers