A tool for extracting tables, figures, maps, and pictures from PDFs using Tesseract
Script for prepping a PDF for table extraction. Converts each page of the PDF to a PNG with Ghostscript, then runs the PNGs through Tesseract. Also runs each page through annotate.py
to assist in debugging. Assumes local installation of tesseract-ocr.
./preprocess.sh ./my_document_processed my_document.pdf
This creates the file structure necessary for extraction:
document_name
annotated (pngs of what tesseract sees)
png (each page of the PDF as a PNG image)
tables (extractions)
tesseract (HTML for each page produced by tesseract)
orig.pdf (The original document)
text.txt (The extracted text layer)
Script for processing the output of pdf2hocr
.
python do_extract.py ~/Documents/doc
Creates a PNG that shows the areas of a page identified by Tesseract. Useful for debugging.
Various functions for processing tables.
Entry script to table extraction.
Process a document for tables. Pass it a path to a document that has been pre-processed with pdf2hocr