HazyResearch/pdftotree

Loss of information oftentimes in the last line of a table

linM24 opened this issue · 7 comments

Describe the bug
I've tried the plain pdftotree command line utility on a few pdf files with tables, and found wherever there is a table structure, the last line is usually not captured in the output hOCR file.

May I ask is that an expected behavior, or it has something to do with the extract_tables utility?

To Reproduce
Steps to reproduce the behavior:

  1. sample pdf downloaded from https://www.w3.org/WAI/WCAG20/Techniques/working-examples/PDF20/table.pdf
  2. run pdftotree pdf/table.pdf" -o hocr/table.hocr
  3. check hOCR output

Expected behavior
The last line of the table is not extracted in the output.

Environment (please complete the following information):

  • OS: macOS 10.15.6
  • pdftotree Version: 0.5.0
  • pdfminer.six Version: 20200726

Additional context
Same behaviors occurred on a few other files I used.

This is not an expected behaviour.
In addition to the missing last row of the table, I can see some duplicates of cells.
However, this may not be pdftotree's bug as it relies on tabula for the table recognition.
I'd appreciate if you could try directly tabula-py on the same pdf.

Yea, that's also what I thought.

Will do! Thanks

This is not an expected behaviour.
In addition to the missing last row of the table, I can see some duplicates of cells.
However, this may not be pdftotree's bug as it relies on tabula for the table recognition.
I'd appreciate if you could try directly tabula-py on the same pdf.

Sorry for the delay. It turns out tabula works fine on the PDFs I used. Although sometimes it may not be able to accurately convert a table structure into a dataframe or JSON, the pure text information is fully preserved. So I suspect there might be a minor problem in the pipeline of parsing the output of tabula-py.

I looked into this issue and confirmed that it is a pdftotree's bug in the way how it specifies a table area.

$ pdftotree table.pdf -o table.hocr -vv
[INFO] pdftotree.core - Digitized PDF detected, building tree structure...
[WARNING] pdftotree.utils.pdf.pdf_parsers - No boxes to get figures from on page 1.
[INFO] pdftotree.core - Tree structure built, creating html...
[DEBUG] pdftotree.TreeExtract - Calling tabula at page: 1 and area: (146.20799999999997, 90.0, 331.78175999999996, 539.4936).
[DEBUG] pdftotree.TreeExtract - Tabula recognized 1 table(s).
[INFO] pdftotree.core - HTML created.
hOCR output to table.hocr

As can be seen in the log message, pdftotree specified a table area as (146.20799999999997, 90.0, 331.78175999999996, 539.4936) (top, left, bottom, right).
This is actually a few pixels smaller than the actual table.
Screen Shot 2020-12-12 at 16 35 19

I wonder where this pixel shift happens.

I think I figured out what was happening.
When you run pdftotree without -mt option, it will detect a table heuristically.

# use heuristics to get tables if no model_type is provided
else:
for page_num in self.elems.keys():
tables[page_num] = self.get_tables_page_num(page_num)

The heuristic used here is that words are vertically aligned in a table.

tbls, tbl_features = cluster_vertically_aligned_boxes(
boxes,
elems.layout.bbox,
avg_font_pts,
width,
char_width,
boxes_segments,
boxes_curves,
boxes_figures,
page_width,
combine,
)
return tbls, tbl_features

So the table area detected by this heuristic: (146.20799999999997, 90.0, 331.78175999999996, 539.4936) is actually correct in the way how a table is detected. This area covers all the words in the table. However it does not include the table border lines.

A short-term workaround would be to use -mt option (probably with vision).
A long-term fix would be either to fix the heuristics or offload the table detection to tabula.