
Locate and extract table and figure elements and caption references

Primary LanguagePython


A tool for extracting tables, figures, maps, and pictures from PDFs using Tesseract


Script for prepping a PDF for table extraction. Converts each page of the PDF to a PNG with Ghostscript, then runs the PNGs through Tesseract. Also runs each page through annotate.py to assist in debugging. Assumes local installation of tesseract-ocr.

Example usage

./preprocess.sh ./my_document_processed my_document.pdf

This creates the file structure necessary for extraction:

  annotated (pngs of what tesseract sees)
  png (each page of the PDF as a PNG image)
  tables (extractions)
  tesseract (HTML for each page produced by tesseract)
  orig.pdf (The original document)
  text.txt (The extracted text layer)


Script for processing the output of pdf2hocr.

Example usage

python do_extract.py ~/Documents/doc


Creates a PNG that shows the areas of a page identified by Tesseract. Useful for debugging.


Various functions for processing tables.


Entry script to table extraction.


Process a document for tables. Pass it a path to a document that has been pre-processed with pdf2hocr