Extract tables from PDF files. tabula-extractor
is the table extraction engine that powers Tabula, now available as a library and command line program.
Versions 0.9.6 and greater of Tabula can export shell scripts using tabula-extractor
for bulk extraction.
tabula-extractor
only works with JRuby 1.7 or newer. Install JRuby and run
jruby -S gem install tabula-extractor
Tabula helps you extract tables from PDFs
Usage:
tabula [options] <pdf_file>
where [options] are:
Tabula helps you extract tables from PDFs
--pages, -p <s>: Comma separated list of ranges. Examples: --pages
1-3,5-7 or --pages 3. Default is --pages 1 (default:
1)
--area, -a <s>: Portion of the page to analyze
(top,left,bottom,right). Example: --area
269.875,12.75,790.5,561. Default is entire page
--columns, -c <s>: X coordinates of column boundaries. Example --columns
10.1,20.2,30.3
--password, -s <s>: Password to decrypt document. Default is empty
(default: )
--guess, -g: Guess the portion of the page to analyze per page.
--debug, -d: Print detected table areas instead of processing.
--format, -f <s>: Output format (CSV,TSV,HTML,JSON) (default: CSV)
--outfile, -o <s>: Write output to <file> instead of STDOUT (default: -)
--spreadsheet, -r: Force PDF to be extracted using spreadsheet-style
extraction (if there are ruling lines separating each
cell, as in a PDF of an Excel spreadsheet)
--no-spreadsheet, -n: Force PDF not to be extracted using spreadsheet-style
extraction (if there are ruling lines separating each
cell, as in a PDF of an Excel spreadsheet)
--silent, -i: Suppress all stderr output.
--use-line-returns, -u: Use embedded line returns in cells.
--version, -v: Print version and exit
--help, -h: Show this message
These examples use documents contained with tabula-extractor
's test
folder. If you want to follow along, download the document and give it a shot. There's more extensive explanation here.
Extract all the tables from a document into a spreadsheet called output.csv
:
tabula test/heuristic-test-set/spreadsheet/tabla_subsidios.pdf -o output.csv
Extract only the tables on page 1 into a spreadsheet called output.csv
:
tabula --pages 1 test/heuristic-test-set/spreadsheet/strongschools.pdf -o output.csv
Extract only the tables on page 1 into a CSV spreadsheet onto STDOUT (that is, print it out in your terminal window):
tabula --pages 1 test/heuristic-test-set/spreadsheet/strongschools.pdf
Extract the data from the table contained within a certain area on page 1 into a spreadsheet called output.csv
:
tabula test/data/vertical_rulings_bug.pdf --area 250,0,325,1700 --pages 1 -o output.csv
Extract all the tables from a document into a tab-separated spreadsheet called output.tsv
:
tabula test/heuristic-test-set/spreadsheet/strongschools.pdf output.tsv --format TSV #should exclude guff
Extract the table from page 1, using specified locations for column boundaries, into a spreadsheet called output.csv
:
tabula test/data/campaign_donors.pdf -o output.csv --columns 47,147,256,310,375,431,504
tabula-extractor
is also a RubyGem that you can use to programmatically extract tabular data, using the Tabula engine, in your scripts or applications. We don't have docs yet, but the tests are a good source of information.
Here's a very basic example, using the "spreadsheet" extraction method:
require 'tabula'
pdf_file_path = "whatever.pdf"
outfilename = "whatever.csv"
out = open(outfilename, 'w')
extractor = Tabula::Extraction::ObjectExtractor.new(pdf_file_path, :all )
extractor.extract.each do |pdf_page|
pdf_page.spreadsheets.each do |spreadsheet|
out << spreadsheet.to_csv
out << "\n\n"
end
end
out.close
Here's another example using the "original" extraction method, which is useful for tables that don't have ruling lines separating the rows and cells. This example extracts data from only pages 1 and 2.
require 'tabula'
pdf_file_path = "whatever.pdf"
outfilename = "whatever.csv"
out = open(outfilename, 'w')
extractor = Tabula::Extraction::ObjectExtractor.new(pdf_file_path, 1..2)
extractor.extract.each_with_index do |pdf_page, page_index|
page_areas = [[250, 0, 325, 1700]]
page_areas.each do |page_area|
out << pdf_page.get_area(page_area).make_table.to_csv
out << "\n\n"
end
end
extractor.close!
out.close
This similar example using the "original" extraction method, but specifies the location of columns. This is a useful tactic when crappy PDF creation software let one column's text flow into the next column. Unless you specify column locations manually, Tabula would combine the two columns. You can find the column locations using a screen ruler; I find it works well to measure the width of the entire PDF and scale the locations based on the width of the page as PDFBox renders it, as shown in the example below.
require 'tabula'
pdf_file_path = "whatever.pdf"
outfilename = "whatever.csv"
out = open(outfilename, 'w')
extractor = Tabula::Extraction::ObjectExtractor.new(pdf_file_path, 1..2)
extractor.extract.each_with_index do |pdf_page, page_index|
page_areas = [[250, 0, 325, 1700]]
scale_factor = pdf_page.width / 1700
# where 1700 is the width of the page as you measured it.
vertical_ruling_locations = [0, 360, 506, 617, 906, 1034, 1160, 1290, 1418, 1548] #column locations
vertical_rulings = vertical_ruling_locations.map{|n| Tabula::Ruling.new(0, n * scale_factor, 0, 1000)}
page_areas.each do |page_area|
out << pdf_page.get_area(page_area).make_table(:vertical_rulings => vertical_rulings).to_csv
out << "\n\n"
end
end
extractor.close!
out.close
PDFs are a terrible format for transmitting tabular data. Tabula uses two algorithms to try to reconstruct the underlying structure of the data table. This section describes how PDFs represent your data and how Tabula extracts it so you can use tabula-extractor
productively.
PDFs were designed to represent a paper document's layout across various computers and on paper, so they focus on precise positioning. They include primitives for text strings, geometric shapes, images and videos (and more), but no data tables. Tabula includes a Java library called PDFBox to access those embedded text strings and geometric shapes and uses them to reconstruct your table.
Why Can't Tabula Process Scanned Pages? Scanned PDF pages usually contain only one primitive: the image of the scanned page. Since those PDFs don't contain text strings or geometric shapes, Tabula won't be able to reconstruct your data -- unless you run the PDF through an OCR (Optical Character Recognition) program, which re-inserts those text strings into their original position, though the results can be error prone.
Tabula has two distinct algorithms to use for different kinds of tables. It uses a heuristic to try to guess which algorithm to use for each table, but this heuristic is wrong fairly often, so you may need to specify which algorithm to use, using the Extraction Method selector buttons in the GUI or the spreadsheet
or no-spreadsheet
flags on the command line.
- The
spreadsheet
algorithm uses geometric lines to reconstruct the table structure. After discarding oblique lines, the algorithm finds all of the lines' crossing points. Using those crossing points, it creates a large list of minimal rectangular areas (that is, rectangles that contain no other rectangles) that are spreadsheet cells. The minimum bounding box of groups of adjacent cells is a table (called a Spreadsheet object). After spreadsheet objects are created, empty "placeholder" cells are created when a cell in one row (or, likewise, column) spans over a space in which multiple cells are contained in another row. Once we have the dimensions of all the cells on the page, the PDFBox library can get the text contained within each cell. - The
original
orno-spreadsheet
algorithm uses only the position of text element on the page. (Because OCR software doesn't reconstruct lines, this algorithm is the only algorithm available for OCRed PDFs.) The algorithm collects all the text on the page (or within the area of the page that contains a table, specified with the Tabula GUI or the--area
flag) and finds "rivers" -- vertical spaces that don't contain any text for the entire height of the table. These are considered column boundaries. (If text from one column flows into another column because the PDF was created with crappy software, you can specify it manually with the--columns
flag ) Each line of text on the page (by unique y locations) is considered a separate line in the table. (If cells contain multiple rows, you may have to write a script to "roll them up" -- Tabula can't provide this functionality.)
These two algorithms are inspired by some academic work, including Anssi Nurminen's "Algorithmic Extraction of Data in Tables in Pdf Documents" (2013) for the spreadsheet algorithm.
You're welcome to try to integrate the tabula-extractor
gem into your project. We don't really have documentation yet, though the tests may be a good source. If you're going to, please feel free to drop us a note and we may be able to give you some pointers.