jcushman/pdfquery

PyQuery objects returned by items() have problems

Closed this issue · 2 comments

ezk84 commented

Given a = pdf.pq('LTTextLineHorizontal').items().next()

  1. a.find(':in_bbox("x0,y0,x1,y1")') raises an ExpressionError: The pseudo-class :in_bbox() is unknown
  2. a.parent('LTPage') returns an empty list, even though a.parents().filter(lambda i, a: a.tag == 'LTPage') returns the expected parent (assume here that the LTPage is the direct parent of the element matched by a).

These two calls would have succeeded had a not been a result of the items iterator, like a = pdf.pq('LTTextLineHorizontal[index="13"]')

I'm having some issues on similar lines.. it seems the pyquery interface works erratically sometimes... I'll try to get a reproducible error..

These both currently work for me:

In [54]: next(pdf.pq('LTTextLineHorizontal').items()).find(':in_bbox("0,0,10000,10000")')
Out[54]: [<LTTextBoxHorizontal>]
In [61]: next(pdf.pq('LTTextLineHorizontal').items()).parent()
Out[61]: [<LTRect>]

In [62]: next(pdf.pq('LTTextLineHorizontal').items()).parent('LTRect')
Out[62]: [<LTRect>]

Feel free to reopen if you can reproduce your error.