Loss of information oftentimes in the last line of a table

Describe the bug
I've tried the plain pdftotree command line utility on a few pdf files with tables, and found wherever there is a table structure, the last line is usually not captured in the output hOCR file.

May I ask is that an expected behavior, or it has something to do with the extract_tables utility?

To Reproduce
Steps to reproduce the behavior:

sample pdf downloaded from https://www.w3.org/WAI/WCAG20/Techniques/working-examples/PDF20/table.pdf
run pdftotree pdf/table.pdf" -o hocr/table.hocr
check hOCR output

Expected behavior
The last line of the table is not extracted in the output.

Environment (please complete the following information):

OS: macOS 10.15.6
pdftotree Version: 0.5.0
pdfminer.six Version: 20200726

Additional context
Same behaviors occurred on a few other files I used.

This is not an expected behaviour.
In addition to the missing last row of the table, I can see some duplicates of cells.
However, this may not be pdftotree's bug as it relies on tabula for the table recognition.
I'd appreciate if you could try directly tabula-py on the same pdf.

Yea, that's also what I thought.

Will do! Thanks

This is not an expected behaviour.
In addition to the missing last row of the table, I can see some duplicates of cells.
However, this may not be pdftotree's bug as it relies on tabula for the table recognition.
I'd appreciate if you could try directly tabula-py on the same pdf.

Sorry for the delay. It turns out tabula works fine on the PDFs I used. Although sometimes it may not be able to accurately convert a table structure into a dataframe or JSON, the pure text information is fully preserved. So I suspect there might be a minor problem in the pipeline of parsing the output of tabula-py.

I looked into this issue and confirmed that it is a pdftotree's bug in the way how it specifies a table area.

$ pdftotree table.pdf -o table.hocr -vv
[INFO] pdftotree.core - Digitized PDF detected, building tree structure...
[WARNING] pdftotree.utils.pdf.pdf_parsers - No boxes to get figures from on page 1.
[INFO] pdftotree.core - Tree structure built, creating html...
[DEBUG] pdftotree.TreeExtract - Calling tabula at page: 1 and area: (146.20799999999997, 90.0, 331.78175999999996, 539.4936).
[DEBUG] pdftotree.TreeExtract - Tabula recognized 1 table(s).
[INFO] pdftotree.core - HTML created.
hOCR output to table.hocr

As can be seen in the log message, pdftotree specified a table area as (146.20799999999997, 90.0, 331.78175999999996, 539.4936) (top, left, bottom, right).
This is actually a few pixels smaller than the actual table.

I wonder where this pixel shift happens.

I think I figured out what was happening.
When you run pdftotree without -mt option, it will detect a table heuristically.

pdftotree/pdftotree/TreeExtract.py

Lines 256 to 259 in 0686a18

    
           # use heuristics to get tables if no model_type is provided 
        
           else: 
        
               for page_num in self.elems.keys(): 
        
                   tables[page_num] = self.get_tables_page_num(page_num)

The heuristic used here is that words are vertically aligned in a table.

pdftotree/pdftotree/utils/pdf/pdf_parsers.py

Lines 54 to 66 in 0686a18

    
           tbls, tbl_features = cluster_vertically_aligned_boxes( 
        
               boxes, 
        
               elems.layout.bbox, 
        
               avg_font_pts, 
        
               width, 
        
               char_width, 
        
               boxes_segments, 
        
               boxes_curves, 
        
               boxes_figures, 
        
               page_width, 
        
               combine, 
        
           ) 
        
           return tbls, tbl_features

So the table area detected by this heuristic: (146.20799999999997, 90.0, 331.78175999999996, 539.4936) is actually correct in the way how a table is detected. This area covers all the words in the table. However it does not include the table border lines.

A short-term workaround would be to use -mt option (probably with vision).
A long-term fix would be either to fix the heuristics or offload the table detection to tabula.

	# use heuristics to get tables if no model_type is provided
	else:
	for page_num in self.elems.keys():
	tables[page_num] = self.get_tables_page_num(page_num)

	tbls, tbl_features = cluster_vertically_aligned_boxes(
	boxes,
	elems.layout.bbox,
	avg_font_pts,
	width,
	char_width,
	boxes_segments,
	boxes_curves,
	boxes_figures,
	page_width,
	combine,
	)
	return tbls, tbl_features