chezou/tabula-py

Issues finding a table that has bottom borders missing

drennapete opened this issue · 4 comments

Summary of your issue

Same issue as described on this closed issue: #274

Basically the bottom row gets dropped if it doesn't have a complete bottom border. Example below. You'll see the row with "Sprinklers' in the left column is not detected.

Example 1.pdf

Check list before submit

Did you read [FAQ]: yes

(Optional, but really helpful) Your PDF URL: ? yes, see above

Paste the output of import tabula; tabula.environment_info() on Python REPL: ? yes

    3.6.4 (v3.6.4:d48eceb, Dec 19 2017, 06:04:45) [MSC v.1900 32 bit (Intel)]
Java version:
    java version "1.8.0_301"
Java(TM) SE Runtime Environment (build 1.8.0_301-b09)
Java HotSpot(TM) 64-Bit Server VM (build 25.301-b09, mixed mode)
tabula-py version: 1.3.1
platform: Windows-10-10.0.22000-SP0
uname:
    uname_result(system='Windows', node='LAPTOP-DVV0H1A8', release='10', version='10.0.22000', machine='AMD64', processor='Intel64 Family 6 Model 142 Stepping 10, GenuineIntel')
linux_distribution: ('', '', '')
mac_ver: ('', ('', '', ''), '')

What did you do when you faced the problem?

Reviewed this thread, however increasing area did not solve: https://stackoverflow.com/questions/60883312/tabula-py-skips-first-page-from-pdf-and-misses-some-tabular-data

Reviewed the bug report here: #274

Tried switching the lattice to True/False, however this did not solve the issue.

Code:

import tabula

df = tabula.read_pdf('Example 1.pdf', spreadsheet=True, guess=False, pages=1, stream=True)
print(df)

Expected behavior:

Expected a table with 16 rows ending with the row containing "Sprinklers" as the final row.

Actual behavior:

using the gui output of same as its easier to comprehend due to table size:

preview1

Related Issues:

#274

It's not perfect, but dropping spreadsheet=True looks working with stream=True. spreadsheet is former parameter name of lattice, so you set lattice=True and stream=True at once.

In [1]: import tabula

In [2]: pdf_path = "./Example.1.pdf"

In [6]: tabula.read_pdf(pdf_path, stream=True, pages=1)
Out[6]:
[                                 Strip Footing -  Unnamed: 0    245.00 $138.53  Unnamed: 1   $33,940   $8,292  ... Unnamed: 2   0%  73% 0%.1 0%.2  $11,403
 0                                          Thick         NaN               NaN         NaN       NaN      NaN  ...        NaN  NaN  NaN  NaN  NaN      NaN
 1                                 Concrete Block         NaN               NaN         NaN       NaN      NaN  ...        NaN  NaN  NaN  NaN  NaN      NaN
 2                          Light Load - Bay Size         NaN               NaN         NaN       NaN      NaN  ...        NaN  NaN  NaN  NaN  NaN      NaN
 3                                Column Footings         NaN   14,981.00 $0.40         NaN    $5,992   $1,464  ...        NaN   0%  73%   0%   0%   $2,013
 4                               > 625 <= 1225 sf         NaN               NaN         NaN       NaN      NaN  ...        NaN  NaN  NaN  NaN  NaN      NaN
 5        Lowest Concrete Floor Medium > 4" <= 6"         NaN               NaN         NaN       NaN      NaN  ...        NaN  NaN  NaN  NaN  NaN      NaN
 6                                            NaN         NaN   14,981.00 $6.08         NaN   $91,084  $22,253  ...        NaN   0%  73%   0%   0%  $30,601
 7                                (on fill) Thick         NaN               NaN         NaN       NaN      NaN  ...        NaN  NaN  NaN  NaN  NaN      NaN
 8                            Steel Frame - Steel         NaN               NaN         NaN       NaN      NaN  ...        NaN  NaN  NaN  NaN  NaN      NaN
 9         Framed Roof (Not Light Load - Bay Size         NaN               NaN         NaN       NaN      NaN  ...        NaN  NaN  NaN  NaN  NaN      NaN
 10                                           NaN         NaN  14,981.00 $13.05         NaN  $195,502  $47,763  ...        NaN   0%  73%   0%   0%  $65,682
 11               Including Roof > 625 <= 1225 sf         NaN               NaN         NaN       NaN      NaN  ...        NaN  NaN  NaN  NaN  NaN      NaN
 12                                     Finishes)         NaN               NaN         NaN       NaN      NaN  ...        NaN  NaN  NaN  NaN  NaN      NaN
 13                         >8'' Thick - Standard         NaN               NaN         NaN       NaN      NaN  ...        NaN  NaN  NaN  NaN  NaN      NaN
 14                           Base Wall - Masonry         NaN   4,029.00 $15.33         NaN   $61,765  $15,090  ...        NaN   0%  73%   0%   0%  $20,751
 15                                         Block         NaN               NaN         NaN       NaN      NaN  ...        NaN  NaN  NaN  NaN  NaN      NaN
 16                                Wood Sectional         NaN               NaN         NaN       NaN      NaN  ...        NaN  NaN  NaN  NaN  NaN      NaN
 17           Base Wall - Doors Overhead Manually         NaN     303.56 $26.25         NaN    $7,968   $1,947  ...        NaN   0%  73%   0%   0%   $2,677
 18                                      Operated         NaN               NaN         NaN       NaN      NaN  ...        NaN  NaN  NaN  NaN  NaN      NaN
 19  Additive Wall - Metal Light - 30 to 26 ga. -         NaN               NaN         NaN       NaN      NaN  ...        NaN  NaN  NaN  NaN  NaN      NaN
 20                                           NaN         NaN    1,864.69 $6.87         NaN   $12,810   $3,130  ...        NaN   0%  73%   0%   0%   $4,304
 21                            Siding Prefinished         NaN               NaN         NaN       NaN      NaN  ...        NaN  NaN  NaN  NaN  NaN      NaN
 22                     Flat Roof Built up- 4 ply         NaN   14,981.00 $7.98         NaN  $119,548  $29,207  ...        NaN   0%  73%   0%   0%  $40,164
 23                   Insulation - Rigid 2" (R10)         NaN   14,981.00 $2.36         NaN   $35,355   $8,638  ...        NaN   0%  73%   0%   0%  $11,878
 24                             >6" <=10" Painted         NaN               NaN         NaN       NaN      NaN  ...        NaN  NaN  NaN  NaN  NaN      NaN
 25                            Partitions - Block         NaN   2,124.00 $19.58         NaN   $41,588  $10,160  ...        NaN   0%  73%   0%   0%  $13,972
 26                                          only         NaN               NaN         NaN       NaN      NaN  ...        NaN  NaN  NaN  NaN  NaN      NaN
 27               Dock Levellers Mechanical ( # )         NaN    4.00 $4,965.03         NaN   $19,860   $4,852  ...        NaN   0%  73%   0%   0%   $6,672
 28     Lighting - Open Strip Average >=.50 <1.00         NaN               NaN         NaN       NaN      NaN  ...        NaN  NaN  NaN  NaN  NaN      NaN
 29                                           NaN         NaN   14,981.00 $2.82         NaN   $42,246  $10,321  ...        NaN   0%  73%   0%   0%  $14,193
 30                       Fluorescent watts/sq ft         NaN               NaN         NaN       NaN      NaN  ...        NaN  NaN  NaN  NaN  NaN      NaN
 31              Industrial Floor Drains Adequate         NaN   14,981.00 $0.22         NaN    $3,296     $805  ...        NaN   0%  73%   0%   0%   $1,107
 32                           Heating - Suspended         NaN               NaN         NaN       NaN      NaN  ...        NaN  NaN  NaN  NaN  NaN      NaN
 33                 Unit Gas Heaters - No Average         NaN   14,981.00 $4.14         NaN   $62,021  $15,152  ...        NaN   0%  73%   0%   0%  $20,837
 34                                         Ducts         NaN               NaN         NaN       NaN      NaN  ...        NaN  NaN  NaN  NaN  NaN      NaN
 35                 Sprinklers Open - Warehouse /         NaN   14,981.00 $2.24         NaN   $33,557   $8,198  ...        NaN   0%  73%   0%   0%  $11,274

 [36 rows x 13 columns]]

ok thanks @chezou! So two related questions:

  1. Is there is no way to get the bottom row using the 'lattice' parameter?
  2. If using 'stream' is there anyway of avoiding the merging of the cell content? For example in the first column of you above output at row 35: "Sprinklers Open - Warehouse /" should be two columns "Sprinklers" and "Open Warehouse /"

I can work with cells being split across multiple rows, but having that with merged columns is impossible to process.

Is there is no way to get the bottom row using the 'lattice' parameter?

In this case, No. The limitation of lattice parameter is it doesn't work well without having ruling line. Here is from the document of tabula-java:

 -l,--lattice               Force PDF to be extracted using lattice-mode
                            extraction (if there are ruling lines
                            separating each cell, as in a PDF of an Excel
                            spreadsheet)

So, if you want to use lattice flag, the table should have ruling lines around cells.

If using 'stream' is there anyway of avoiding the merging of the cell content? For example in the first column of you above output at row 35: "Sprinklers Open - Warehouse /" should be two columns "Sprinklers" and "Open Warehouse /"

I think it's the limitation of stream option. PDF doesn't have specification to represent table, so tabula-java extracts by heuristics. There is no cell object in PDF, so it's hard to know which is "a cell".

ok, thanks for your time @chezou. I think I'll have to run both, then try to reconcile the two outputs.