camelot-dev/camelot

Difficulties with Multi-line headers. Rows shifted down.

poetaster opened this issue · 5 comments

Describe the bug
This pdf, https://poetaster.de/misc/118.pdf (which I'm not uploading here since it may be a copyright issue) is read well but camelot shifts the rows under the multi-header controllability, down.

Steps to reproduce the bug

Load the above file and try both stream and lattice reading. I tried a lot of variations:

stream with different row tolerances:
dfs = camelot.read_pdf('118.pdf', flavor='stream', row_tol=20,flag_size=True)

and lattice with many scale and shift variations.

dfs = camelot.read_pdf('118.pdf', flavor='lattice', shift_text=['r','t', 'r', 't'], line_scale=20)

dataframe

Lattice appears to get it right:

camelot.plot(dfs[0], kind='grid').show()

lattice

Which seems correct. But it always shifts the rows in the controllability part.

Expected behavior

Rows should not be shifted.

Code

Began with:

import camelot
dfs = camelot.read_pdf('118.pdf') 

And tried many variation, most recent lattice being:
dfs = camelot.read_pdf('118.pdf', flavor='lattice', shift_text=['r','t', 'r', 't'], line_scale=20)

PDF
See above.

Screenshots
See above.

Environment

  • OS: ubuntu 22
  • Python version: 3.10.12
  • Numpy version: 1.24.0
  • OpenCV version: 4.8.1.78
  • Ghostscript version: 0.7
  • Camelot version: 0.11.0

Additional context