tabula-py
is a simple Python wrapper of tabula-java, which can read table of PDF.
You can read tables from PDF and convert into pandas's DataFrame.
- Java
- Confirmed working with Java 7, 8
- pandas
pip install tabula-py
tabula-py enables you to extract table from PDF into DataFrame and JSON. It also can extract tables from PDF and save file as CSV, TSV or JSON.
import tabula
# Read pdf into DataFrame
df = tabula.read_pdf("test.pdf", options)
# Read remote pdf into DataFrame
df2 = tabula.read_pdf("https://github.com/tabulapdf/tabula-java/raw/master/src/test/resources/technology/tabula/arabic.pdf")
# convert PDF into CSV
tabula.convert_into("test.pdf", "output.csv", output_format="csv")
See example notebook
- pages (str, int,
list
ofint
, optional)- An optional values specifying pages to extract from. It allows
str
,int
,list
ofint
. - Example: 1, '1-2,3', 'all' or [1,2]. Default is 1
- An optional values specifying pages to extract from. It allows
- guess (bool, optional):
- Guess the portion of the page to analyze per page.
- area (
list
offloat
, optional):- Portion of the page to analyze(top,left,bottom,right).
- Example: [269.875, 12.75, 790.5, 561]. Default is entire page
- spreadsheet (bool, optional):
- Force PDF to be extracted using spreadsheet-style extraction (if there are ruling lines separating each cell, as in a PDF of an Excel spreadsheet)
- nospreadsheet (bool, optional):
- Force PDF not to be extracted using spreadsheet-style extraction (if there are ruling lines separating each cell, as in a PDF of an Excel spreadsheet)
- password (bool, optional):
- Password to decrypt document. Default is empty
- silent (bool, optional):
- Suppress all stderr output.
- columns (list, optional):
- X coordinates of column boundaries.
- Example: [10.1, 20.2, 30.3]
- format (str, optional):
- Format for output file or extracted object. (CSV, TSV, JSON)
- output_path (str, optional):
- Output file path. File format of it is depends on
format
. - Same as
--outfile
option of tabula-java.
- Output file path. File format of it is depends on
Yes. You can use options
argument as following. The format is same as cli of tabula-java.
read_pdf_table(file_path, options="--columns 10.1,20.2,30.3")
In short, you can extract with area
and spreadsheet
option.
In [4]: tabula.read_pdf('./table.pdf', spreadsheet=True, area=(337.29, 226.49, 472.85, 384.91))
Picked up JAVA_TOOL_OPTIONS: -Dfile.encoding=UTF-8
Out[4]:
Unnamed: 0 Col2 Col3 Col4 Col5
0 A B 12 R G
1 NaN R T 23 H
2 B B 33 R A
3 C T 99 E M
4 D I 12 34 M
5 E I I W 90
6 NaN 1 2 W h
7 NaN 4 3 E H
8 F E E4 R 4
How to use area
option
According to tabula-java wiki, there is a explain how to specify the area: https://github.com/tabulapdf/tabula-java/wiki/Using-the-command-line-tabula-extractor-tool#grab-coordinates-of-the-table-you-want
For example, using macOS's preview, I got area information of this PDF:
java -jar ./target/tabula-0.9.0-jar-with-dependencies.jar -p all -a $y1,$x1,$y2,$x2 -o $csvfile $filename
given
Note the left, top, height, and width parameters and calculate the following:
y1 = top
x1 = left
y2 = top + height
x2 = left + width
I confirmed with tabula-java:
java -jar ./tabula/tabula-0.9.1-jar-with-dependencies.jar -g -r -a "337.29,226.49,472.85,384.91" table.pdf
Without -r
(same as --spreadsheet
) option, it does not work properly.