Delete not defined
Closed this issue · 3 comments
The function delete in core.py is not defined.
I get the following error whenever this code is executed.
File "test_to_pandas.py", line 6, in
cells = [pdf.process_page("pg_0001.pdf",p) for p in pages]
File "/usr/local/lib/python2.7/dist-packages/pdf_table_extract-0.1-py2.7.egg
!/pdftableextract/core.py", line 179, in process_page
vd = delete(vd,i)
NameError: global name 'delete' is not defined
This code section comes from core.py
j = 0
while j < len(hd):
if hd[j+1]-hd[j] > maxdiv :
hd = delete(hd,j)
hd = delete(hd,j)
else:
j=j+2
I have attached an image of the PDF I was trying to parse.
screenshot from 2013-10-09 17 55 27
I found the PDF you are using. That change fixes the bug, and 6118079 fixes another bug in the command line version of the code. But neither solves your problem. Partly because of the header and surrounding text, but also because the lines in the table are too thin and text too fuzzy to parse I think
Cropping will remove most of the text (-c 0.25:1.0:-0.35:7.5 works for me), but the lines are not showing up clearly to the parser for some reason. Even though -checklines shows some valid lines
pdf-table-extract -i a.pdf -p 3 -checklines -o a.pnm -colmult 10.0 -c0.25:1.0:-0.35:-7.0 -r200 ; convert a.pnm a.png ; open a.png
The dividers and cells don't show clean separation. After some tuning I get
pdf-table-extract -i a.pdf -p 3 -o a.pnm -colmult 10.0 -c0.25:1.0:-0.35:-7.0 -r300 --line_length 0.12 -ttable_csv; convert a.pnm a.png ; open a.png
which produces the image below, but still doesn't return data in the cells.
Thanks, I will continue to experiment to see if I can get these PDFs to work.