slub/textract2page

increase coverage

bertsky opened this issue · 0 comments

  • Confidence (unfortunately, this conflates Coords and Text @conf)
  • TextType (HANDWRITING@production=handwritten-printscript|handwritten-cursive, PRINTED@production=printed)
  • support tables:
    • top-level TableRegion for TABLE block
    • recursive TextRegion for CELL block (i.e. ColumnIndexRoles/TableCellRole/@columnIndex, RowIndexRoles/TableCellRole/@rowIndex)
    • recursive TextRegion for MERGED_CELL block (i.e. ColumnSpanRoles/TableCellRole/@colSpan, RowSpanRoles/TableCellRole/@rowSpan) – diverging recursion between Textract and PAGE?
    • recursive TextRegion for TABLE_TITLE and TABLE_FOOTER block (i.e. Roles/TableCellRole/@header... or via ReadingOrder)
    • EntityTypesSTRUCTURED_TABLE|SEMI_STRUCTURED_TABLE (unclear how to represent in PAGE), TABLE_TITLE|TABLE_SECTION_TITLE|TABLE_FOOTER|TABLE_SUMMARY|COLUMN_HEADER (unclear how this looks and compares with the actual recursive BlockType)?
    • also via ordered groups in ReadingOrder?
    • unclear: LineItemGroup and LineItems
  • PageClassification/PageType (unclear, but probably Page/@type)
  • support forms
    • BlockType=KEY_VALUE_SET and EntityTypes=KEY|VALUE → unclear how to represent: TableRegion or recursive TextRegion? Labels/Label?
  • support checkboxes within tables or forms
    • BlockType=SELECTION_ELEMENT and SelectionStatus=SELECTED|NOT_SELECTED → unclear how to represent
  • ignore query type