ashima/pdf-table-extract

Consider using pdfminer as library / alternative to pdftoppm

Closed this issue · 1 comments

Consider using pdfminer as library / alternative to pdftoppm.

pdfminer is a pure python implementation.

See: https://github.com/euske/pdfminer/

pdfminer would not be 'an alternative' to pdftoppm. It might be an alternative to pdftotxt.

If you want to use pdfminer there some extremely hacky methods to extract lines from the pdf (https://blog.scraperwiki.com/2012/06/pdf-table-extraction-of-a-table/ , https://github.com/obeattie/pdfminer/wiki/How-to-extract-data-from-tables), but nothing showing the success of the current method using the Poppler library. For example, pdfminer would not work on scanned non-OCRed PDFs, whereas the pdf-table-extract package could be extended to pass each cell into an OCR program instead of pdftotxt, or extracting a table from an image by changing the pdftoppm call to a pngtoppm or jpegtoppm (I have no idea if those exist, but Imagemagick will do both conversions).

If you really wanted to parse PDFs using pdfminer, you would be much more successful using more appropriate table detection heuristics than are used in pdf-table-extract (http://people.cs.umass.edu/~mccallum/papers/TableExtraction-irj06.pdf , http://www.amazon.com/Intelligent-Scientific-Information-Computational-Intelligence/dp/364224808X/ref=sr_1_1?s=books&ie=UTF8&qid=1385592535&sr=1-1&keywords=364224808X) but for the tables we wanted to extract, it works perfectly once tuned.