Error with annotations
Opened this issue · 3 comments
Found an issue when upgrading from pdfquery 0.2.7 to 0.4.3. Looks like starting in 0.3.0, support for annotations was added. This is what appears to be happening. In the _add_annots() method in pdfquery.py, an annotation object is found by pdfminer. _add_annots() retrieves this object and converts all information into strings (via obj_to_string()). This method is called again and pdfminer returns a cached version of the annotation object, only this time, all the information has been converted into strings by pdfquery. This leads to an error on line 649:
annot['URI'] = resolve1(annot['A'])['URI']
The first time through _add_annots(), resolve1(annot['A']) returns a dict with 'URI' being one of the keys. On the second time through, annot['A'] is a string representation (converted by obj_to_string) of that dict and so the line fails.
I've attached a PDF file (annot.pdf) to show the problem. This file only has one line of text (a company's home page URL) which is being seen as an annotation.
This error has been found with:
- pdfquery version 0.3.0, 0.4.x
- pdfminer 20140328
- python 2.7.1
- Fedora Linux 23
If there's any other information that would help, let me know.
Do you have example code that reproduces this error? pdf.load()
is working for me with your supplied file.
I'm assuming you have a little script to send the file to pdf.load(). Could you attach that? That way, I can run exactly what you did with the same file. If it doesn't work for me, that could indicate something else is causing the issue. If it does work, I can trace the difference between what you did vs what my code is doing. It could be also possible that by removing the private information from the PDF file, I also removed what was causing the problem. It's been so long since I've submitted this that I don't remember. I think I verified that I was still having the issue with the PDF I've attached, but can't remember for sure.
I seem to be getting this error as well. PDF file here. I have also tested on this
- pdfquery 0.4.3
- pdfminer 20170720
- python 3.6
- OSX 10.12.6
Error message:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-3-6d31003dedab> in <module>()
13 pdf = pdfquery.PDFQuery("../"+name)
14 pdf.load()
---> 15 tree = pdf.get_tree()
16 #tree.write("current.xml", pretty_print=True)
17
~/anaconda3/lib/python3.6/site-packages/pdfquery/pdfquery.py in get_tree(self, *page_numbers)
485 else:
486 pages = enumerate(self.get_layouts())
--> 487 for n, page in pages:
488 page = self._xmlize(page)
489 page.set('page_index', obj_to_string(n))
~/anaconda3/lib/python3.6/site-packages/pdfquery/pdfquery.py in <genexpr>(.0)
606 def get_layouts(self):
607 """ Get list of PDFMiner Layout objects for each page. """
--> 608 return (self.get_layout(page) for page in self._cached_pages())
609
610 def _cached_pages(self, target_page=-1):
~/anaconda3/lib/python3.6/site-packages/pdfquery/pdfquery.py in get_layout(self, page)
601 self.interpreter.process_page(page)
602 layout = self.device.get_result()
--> 603 layout = self._add_annots(layout, page.annots)
604 return layout
605
~/anaconda3/lib/python3.6/site-packages/pdfquery/pdfquery.py in _add_annots(self, layout, annots)
647 annot = self._set_hwxy_attrs(annot)
648 try:
--> 649 annot['URI'] = resolve1(annot['A'])['URI']
650 except KeyError:
651 pass
TypeError: string indices must be integers