Difficulties with Multi-line headers. Rows shifted down.
poetaster opened this issue · 5 comments
Describe the bug
This pdf, https://poetaster.de/misc/118.pdf (which I'm not uploading here since it may be a copyright issue) is read well but camelot shifts the rows under the multi-header controllability, down.
Steps to reproduce the bug
Load the above file and try both stream and lattice reading. I tried a lot of variations:
stream with different row tolerances:
dfs = camelot.read_pdf('118.pdf', flavor='stream', row_tol=20,flag_size=True)
and lattice with many scale and shift variations.
dfs = camelot.read_pdf('118.pdf', flavor='lattice', shift_text=['r','t', 'r', 't'], line_scale=20)
Lattice appears to get it right:
camelot.plot(dfs[0], kind='grid').show()
Which seems correct. But it always shifts the rows in the controllability part.
Expected behavior
Rows should not be shifted.
Code
Began with:
import camelot
dfs = camelot.read_pdf('118.pdf')
And tried many variation, most recent lattice being:
dfs = camelot.read_pdf('118.pdf', flavor='lattice', shift_text=['r','t', 'r', 't'], line_scale=20)
PDF
See above.
Screenshots
See above.
Environment
- OS: ubuntu 22
- Python version: 3.10.12
- Numpy version: 1.24.0
- OpenCV version: 4.8.1.78
- Ghostscript version: 0.7
- Camelot version: 0.11.0
Additional context