Issues finding a table that has bottom borders missing
drennapete opened this issue · 4 comments
Summary of your issue
Same issue as described on this closed issue: #274
Basically the bottom row gets dropped if it doesn't have a complete bottom border. Example below. You'll see the row with "Sprinklers' in the left column is not detected.
Check list before submit
Did you read [FAQ]: yes
(Optional, but really helpful) Your PDF URL: ? yes, see above
Paste the output of import tabula; tabula.environment_info() on Python REPL: ? yes
3.6.4 (v3.6.4:d48eceb, Dec 19 2017, 06:04:45) [MSC v.1900 32 bit (Intel)]
Java version:
java version "1.8.0_301"
Java(TM) SE Runtime Environment (build 1.8.0_301-b09)
Java HotSpot(TM) 64-Bit Server VM (build 25.301-b09, mixed mode)
tabula-py version: 1.3.1
platform: Windows-10-10.0.22000-SP0
uname:
uname_result(system='Windows', node='LAPTOP-DVV0H1A8', release='10', version='10.0.22000', machine='AMD64', processor='Intel64 Family 6 Model 142 Stepping 10, GenuineIntel')
linux_distribution: ('', '', '')
mac_ver: ('', ('', '', ''), '')
What did you do when you faced the problem?
Reviewed this thread, however increasing area did not solve: https://stackoverflow.com/questions/60883312/tabula-py-skips-first-page-from-pdf-and-misses-some-tabular-data
Reviewed the bug report here: #274
Tried switching the lattice to True/False, however this did not solve the issue.
Code:
import tabula
df = tabula.read_pdf('Example 1.pdf', spreadsheet=True, guess=False, pages=1, stream=True)
print(df)
Expected behavior:
Expected a table with 16 rows ending with the row containing "Sprinklers" as the final row.
Actual behavior:
using the gui output of same as its easier to comprehend due to table size:
Related Issues:
It's not perfect, but dropping spreadsheet=True
looks working with stream=True
. spreadsheet
is former parameter name of lattice
, so you set lattice=True
and stream=True
at once.
In [1]: import tabula
In [2]: pdf_path = "./Example.1.pdf"
In [6]: tabula.read_pdf(pdf_path, stream=True, pages=1)
Out[6]:
[ Strip Footing - Unnamed: 0 245.00 $138.53 Unnamed: 1 $33,940 $8,292 ... Unnamed: 2 0% 73% 0%.1 0%.2 $11,403
0 Thick NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN
1 Concrete Block NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN
2 Light Load - Bay Size NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN
3 Column Footings NaN 14,981.00 $0.40 NaN $5,992 $1,464 ... NaN 0% 73% 0% 0% $2,013
4 > 625 <= 1225 sf NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN
5 Lowest Concrete Floor Medium > 4" <= 6" NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN
6 NaN NaN 14,981.00 $6.08 NaN $91,084 $22,253 ... NaN 0% 73% 0% 0% $30,601
7 (on fill) Thick NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN
8 Steel Frame - Steel NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN
9 Framed Roof (Not Light Load - Bay Size NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN
10 NaN NaN 14,981.00 $13.05 NaN $195,502 $47,763 ... NaN 0% 73% 0% 0% $65,682
11 Including Roof > 625 <= 1225 sf NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN
12 Finishes) NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN
13 >8'' Thick - Standard NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN
14 Base Wall - Masonry NaN 4,029.00 $15.33 NaN $61,765 $15,090 ... NaN 0% 73% 0% 0% $20,751
15 Block NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN
16 Wood Sectional NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN
17 Base Wall - Doors Overhead Manually NaN 303.56 $26.25 NaN $7,968 $1,947 ... NaN 0% 73% 0% 0% $2,677
18 Operated NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN
19 Additive Wall - Metal Light - 30 to 26 ga. - NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN
20 NaN NaN 1,864.69 $6.87 NaN $12,810 $3,130 ... NaN 0% 73% 0% 0% $4,304
21 Siding Prefinished NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN
22 Flat Roof Built up- 4 ply NaN 14,981.00 $7.98 NaN $119,548 $29,207 ... NaN 0% 73% 0% 0% $40,164
23 Insulation - Rigid 2" (R10) NaN 14,981.00 $2.36 NaN $35,355 $8,638 ... NaN 0% 73% 0% 0% $11,878
24 >6" <=10" Painted NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN
25 Partitions - Block NaN 2,124.00 $19.58 NaN $41,588 $10,160 ... NaN 0% 73% 0% 0% $13,972
26 only NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN
27 Dock Levellers Mechanical ( # ) NaN 4.00 $4,965.03 NaN $19,860 $4,852 ... NaN 0% 73% 0% 0% $6,672
28 Lighting - Open Strip Average >=.50 <1.00 NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN
29 NaN NaN 14,981.00 $2.82 NaN $42,246 $10,321 ... NaN 0% 73% 0% 0% $14,193
30 Fluorescent watts/sq ft NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN
31 Industrial Floor Drains Adequate NaN 14,981.00 $0.22 NaN $3,296 $805 ... NaN 0% 73% 0% 0% $1,107
32 Heating - Suspended NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN
33 Unit Gas Heaters - No Average NaN 14,981.00 $4.14 NaN $62,021 $15,152 ... NaN 0% 73% 0% 0% $20,837
34 Ducts NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN
35 Sprinklers Open - Warehouse / NaN 14,981.00 $2.24 NaN $33,557 $8,198 ... NaN 0% 73% 0% 0% $11,274
[36 rows x 13 columns]]
ok thanks @chezou! So two related questions:
- Is there is no way to get the bottom row using the 'lattice' parameter?
- If using 'stream' is there anyway of avoiding the merging of the cell content? For example in the first column of you above output at row 35: "Sprinklers Open - Warehouse /" should be two columns "Sprinklers" and "Open Warehouse /"
I can work with cells being split across multiple rows, but having that with merged columns is impossible to process.
Is there is no way to get the bottom row using the 'lattice' parameter?
In this case, No. The limitation of lattice
parameter is it doesn't work well without having ruling line. Here is from the document of tabula-java:
-l,--lattice Force PDF to be extracted using lattice-mode
extraction (if there are ruling lines
separating each cell, as in a PDF of an Excel
spreadsheet)
So, if you want to use lattice
flag, the table should have ruling lines around cells.
If using 'stream' is there anyway of avoiding the merging of the cell content? For example in the first column of you above output at row 35: "Sprinklers Open - Warehouse /" should be two columns "Sprinklers" and "Open Warehouse /"
I think it's the limitation of stream
option. PDF doesn't have specification to represent table, so tabula-java extracts by heuristics. There is no cell object in PDF, so it's hard to know which is "a cell".
ok, thanks for your time @chezou. I think I'll have to run both, then try to reconcile the two outputs.