/PDFLayoutTextStripper

Converts a pdf file into a text file while keeping the layout of the original pdf. Useful to extract the content from a table in a pdf file for instance. This is a subclass of PDFTextStripper class (from the Apache PDFBox library).

Primary LanguageJava

PDFLayoutTextStripper

Converts a PDF file into a text file while keeping the layout of the original PDF. Useful to extract the content from a table or a form in a PDF file. PDFLayoutTextStripper is a subclass of PDFTextStripper class (from the Apache PDFBox library).

  • Use cases
  • How to install
  • How to use

Use cases

Data extraction from a table in a PDF file example

Data extraction from a form in a PDF file example

How to install

  1. Install apache pdfbox manually (to get the v2.0.6 click here ) and its two dependencies commons-logging.jar and fontbox

warning: only pdfbox versions from version 2.0.0 upwards are compatible with this version of PDFLayoutTextStripper.java

How to use on Linux

cd PDFLayoutTextStripper
javac -cp .:/pathto/pdfbox-2.0.6.jar:/pathto/commons-logging-1.2.jar:/pathto/PDFLayoutTextStripper/fontbox-2.0.6.jar *.java
java -cp .:/pathto/pdfbox-2.0.6.jar:/pathto/commons-logging-1.2.jar:/pathto/PDFLayoutTextStripper/fontbox-2.0.6.jar test

How to use on Windows

The same as for Linux (see above) but replace : with ;