jcushman/pdfquery

pdf.pq( :inbbox) pulling duplicate values

Closed this issue · 1 comments

Running the below code on multiple pdfs, the code pulls duplicate values randomly from each box. I examined the .XML file to make sure there weren't two text boxes layered upon each other, and found no instances of duplicates for each page.

When I say the duplicates are created randomly, I mean that the number of duplicates, which values are duplicated, and the order in which they are pulled into text are random.

I'm curious whether you've seen this before and if there is a fix. It's possible that the pdf's themselves are the problem. Let me know if access to the XML file might help. I can probably strip the sensitive information and send.

Any help would be greatly appreciated!

An example of the text in the box is that shown in the below image. I cannot share the whole pdf due to confidentiality.
isopull

#import programs from python libraries
import xlwt
import pdfquery
import csv
import re

pages = raw_input('Please enter the number of pages in the document:    ')

#convert user input to integer
pages = int(pages)

#Path to pdf file for PDFQuery access. PDFQuery is the program that pulls in the data from the pdf
pdf = pdfquery.PDFQuery('D:\New Storage\Coding\Python Projects\Iso Pull\Lack.pdf')

#load pdf to active for PDFQuery
pdf.load(range(0,5))

#cycle through page numbers
for pagenumber in range(0,pages):

    #create a string sub to avoid messiness in the pdf.pq page number callout
    pagesub = 'LTPage[page_index="%s"]' % pagenumber

    #find text in boxes. boxes are inches*72. Lower left corner of box to upper right
    #Also, keep in mind coordinates of BOM and Iso number may need tweaking due to coordinate find

    Item = pdf.pq(pagesub + ' :in_bbox("947.52,379.44,960.48,750.16")').text()
    QTY = pdf.pq(pagesub + ' :in_bbox("960.48,379.44,987.12,750.16")').text()
    Size = pdf.pq(pagesub + ' :in_bbox("987.12,379.44,1020.24,750.16")').text()
    Sch_Minwall = pdf.pq(pagesub + ' :in_bbox("1020.24,379.44,1059.12,750.16")').text()
    Description2 = pdf.pq(pagesub + ' :in_bbox("1059.12,379.44,1203.84,750.16")').text()

Hi! Sorry I missed this. I suspect you're matching multiple nested elements here -- like <foo><bar>text</bar></foo> returns the same text twice for both <foo> and <bar>. You could avoid that by selecting for bar:in_bbox.