Consider using pdfminer as library / alternative to pdftoppm
Closed this issue · 1 comments
Consider using pdfminer as library / alternative to pdftoppm.
pdfminer is a pure python implementation.
pdfminer
would not be 'an alternative' to pdftoppm
. It might be an alternative to pdftotxt
.
If you want to use pdfminer
there some extremely hacky methods to extract lines from the pdf (https://blog.scraperwiki.com/2012/06/pdf-table-extraction-of-a-table/ , https://github.com/obeattie/pdfminer/wiki/How-to-extract-data-from-tables), but nothing showing the success of the current method using the Poppler library. For example, pdfminer
would not work on scanned non-OCRed PDFs, whereas the pdf-table-extract package could be extended to pass each cell into an OCR program instead of pdftotxt
, or extracting a table from an image by changing the pdftoppm
call to a pngtoppm
or jpegtoppm
(I have no idea if those exist, but Imagemagick will do both conversions).
If you really wanted to parse PDFs using pdfminer
, you would be much more successful using more appropriate table detection heuristics than are used in pdf-table-extract
(http://people.cs.umass.edu/~mccallum/papers/TableExtraction-irj06.pdf , http://www.amazon.com/Intelligent-Scientific-Information-Computational-Intelligence/dp/364224808X/ref=sr_1_1?s=books&ie=UTF8&qid=1385592535&sr=1-1&keywords=364224808X) but for the tables we wanted to extract, it works perfectly once tuned.