ashima/pdf-table-extract

Delete not defined

Closed this issue · 3 comments

The function delete in core.py is not defined.

I get the following error whenever this code is executed.
File "test_to_pandas.py", line 6, in
cells = [pdf.process_page("pg_0001.pdf",p) for p in pages]
File "/usr/local/lib/python2.7/dist-packages/pdf_table_extract-0.1-py2.7.egg
!/pdftableextract/core.py", line 179, in process_page
vd = delete(vd,i)
NameError: global name 'delete' is not defined

This code section comes from core.py

  j = 0
  while j < len(hd):
  if hd[j+1]-hd[j] > maxdiv :
      hd = delete(hd,j)
      hd = delete(hd,j)
    else:
      j=j+2

I have attached an image of the PDF I was trying to parse.
screenshot from 2013-10-09 17 55 27

This is caused by not importing the delete function from numpy (see #1 for the related bug). In an older version of the code, numpy was imported as

from numpy import *

It no longer is imported this way, so this bug appears.I've tried to fix it in 9b02062 but I don't have the PDF to test against.

I found the PDF you are using. That change fixes the bug, and 6118079 fixes another bug in the command line version of the code. But neither solves your problem. Partly because of the header and surrounding text, but also because the lines in the table are too thin and text too fuzzy to parse I think

screen shot 2013-10-09 at 3 34 53 pm

Cropping will remove most of the text (-c 0.25:1.0:-0.35:7.5 works for me), but the lines are not showing up clearly to the parser for some reason. Even though -checklines shows some valid lines

pdf-table-extract -i a.pdf -p 3 -checklines -o a.pnm -colmult 10.0 -c0.25:1.0:-0.35:-7.0 -r200 ; convert a.pnm a.png ; open a.png

screen shot 2013-10-09 at 3 51 50 pm

The dividers and cells don't show clean separation. After some tuning I get
pdf-table-extract -i a.pdf -p 3 -o a.pnm -colmult 10.0 -c0.25:1.0:-0.35:-7.0 -r300 --line_length 0.12 -ttable_csv; convert a.pnm a.png ; open a.png which produces the image below, but still doesn't return data in the cells.

screen shot 2013-10-09 at 3 58 22 pm

Thanks, I will continue to experiment to see if I can get these PDFs to work.