tabula-java

tabula-java is a library for extracting tables from PDF files — it is the table extraction engine that used to power Tabula (repo). You can use tabula-java as a command-line tool to programmatically extract tables from PDFs.

(This is the new version of the extraction engine; the previous code can be found at tabula-extractor.)

Download

Download a version of the tabula-java's jar, with all dependencies included, that works on Mac, Windows and Linux from our releases page.

Usage Examples

tabula-java provides a command line application:

$ java -jar ./target/tabula-0.9.1-jar-with-dependencies.jar --help

usage: tabula [-a <AREA>] [-b <DIRECTORY>] [-c <COLUMNS>] [-d] [-f <FORMAT>] [-g] [-h] [-i]
       [-n] [-o <OUTFILE>] [-p <PAGES>] [-r] [-s <PASSWORD>] [-u] [-v]

Tabula helps you extract tables from PDFs
 -a,--area <AREA>           Portion of the page to analyze
                            (top,left,bottom,right). Example: --area
                            269.875,12.75,790.5,561. Default is entire
                            page
 -c,--columns <COLUMNS>     X coordinates of column boundaries. Example
                            --columns 10.1,20.2,30.3
 -d,--debug                 Print detected table areas instead of
                            processing.
 -b,--batch <DIRECTORY>     Convert all .pdfs in the provided directory

 -f,--format <FORMAT>       Output format: (CSV,TSV,JSON). Default: CSV
 -g,--guess                 Guess the portion of the page to analyze per
                            page.
 -h,--help                  Print this help text.
 -i,--silent                Suppress all stderr output.
 -n,--no-spreadsheet        Force PDF not to be extracted using
                            spreadsheet-style extraction (if there are
                            ruling lines separating each cell, as in a PDF
                            of an Excel spreadsheet)
 -o,--outfile <OUTFILE>     Write output to <file> instead of STDOUT.
                            Default: -
 -p,--pages <PAGES>         Comma separated list of ranges, or all.
                            Examples: --pages 1-3,5-7, --pages 3 or
                            --pages all. Default is --pages 1
 -r,--spreadsheet           Force PDF to be extracted using
                            spreadsheet-style extraction (if there are
                            ruling lines separating each cell, as in a PDF
                            of an Excel spreadsheet)
 -s,--password <PASSWORD>   Password to decrypt document. Default is empty
 -u,--use-line-returns      Use embedded line returns in cells. (Only in
                            spreadsheet mode.)
 -v,--version               Print version and exit.

It also includes a debugging tool, run java -cp ./target/tabula-0.9.1-jar-with-dependencies.jar technology.tabula.debug.Debug -h for the available options.

You can also integrate tabula-java with any JVM language. For Java examples, see the tests folder.

JVM start-up time is a lot of the cost of the tabula command, so if you're trying to extract many tables from PDFs, you have a few options for speeding it up:

the drip utility
the Ruby, Python, R, and Node.js bindings
writing your own program in any JVM language (Java, JRuby, Scala) that imports tabula-java.
waiting for us to implement an API/server-style system (it's on the roadmap)

Building from Source

Clone this repo and run:

mvn clean compile assembly:single

Contributing

Interested in helping out? We'd love to have your help!

You can help by:

Reporting a bug.
Adding or editing documentation.
Contributing code via a Pull Request.
Spreading the word about tabula-java to people who might be able to benefit from using it.

Backers

You can also support our continued work on tabula-java with a one-time or monthly donation on OpenCollective. Organizations who use tabula-java can also sponsor the project for acknolwedgement on our official site and this README.

Special thanks to the following users and organizations for generously supporting Tabula with donations and grants: