PanDAWMS/dkb

Table descriptions above tables

Opened this issue · 2 comments

PDF Analyzer's table processing algorithm includes detection of table description and separation of table lines from all other lines. These procedures work on assumption that table description is positioned below the table:
proper_table
However, some documents can position descriptions above tables or even mix both kinds of positioning. PDF Analyzer either fails to extract such tables or extracts them incorrectly.

Document examples: CDS_CERN-ATL-COM-PHYS-2016-135, page 13.

Note: it seems that term "caption" rather than "description" or "header" is often used.

Some work was done on this (see 5e149e7). As usual, there is much to improve - however, I should highlight the fact that measuring the position of main text strings may cause problems with rotated pages. This should be looked into.