OCR-D/ocrd_anybaseocr

block segmentation: overlaps and quality of prebuilt models

bertsky opened this issue · 0 comments

Once I got the block segmentation to actually run, I was puzzled over the extremely bad results of the provided model.

Here's how I gradually worked to isolate the problem.

  • using default 0.9 confidence threshold:
a b
FILE_0001_REGIONS-ANYOCR_bbox-best_pageviewer FILE_0002_REGIONS-ANYOCR_bbox-best_pageviewer
  • using lower 0.5 confidence threshold:
a b
FILE_0001_REGIONS-ANYOCR_bbox-all_pageviewer FILE_0002_REGIONS-ANYOCR_bbox-all_pageviewer
  • using default 0.9 confidence threshold, but annotating a polygon from the mask:
a b
FILE_0001_REGIONS-ANYOCR_mask-best_pageviewer FILE_0002_REGIONS-ANYOCR_mask-best_pageviewer
  • using lower 0.5 confidence threshold, but annotating a polygon from the mask:
a b
FILE_0001_REGIONS-ANYOCR_mask-all_pageviewer FILE_0002_REGIONS-ANYOCR_mask-all_pageviewer
  • using lower 0.5 confidence threshold, but annotating a polygon from the mask, and doing non-maximum suppression and other post-processing (like checking for containment):
a b
FILE_0001_REGIONS-ANYOCR_mask-all-nms_pageviewer FILE_0002_REGIONS-ANYOCR_mask-all-nms_pageviewer
  • using even lower 0.02 confidence threshold, but annotating a polygon from the mask, and suppressing the classes header, footer, footnote, footnote-continued, endnote, keynote (reserving their probability mass):
a b
FILE_0001_REGIONS-ANYOCR_mask-all-active_pageviewer FILE_0002_REGIONS-ANYOCR_mask-all-active_pageviewer
  • using even lower 0.02 confidence threshold, but annotating a polygon from the mask, and suppressing the classes header, footer, footnote, footnote-continued, endnote, keynote (reserving their probability mass), and doing non-maximum suppression and other post-processing (like checking for containment):
a b
FILE_0001_REGIONS-ANYOCR_mask-all-active-nms_pageviewer FILE_0002_REGIONS-ANYOCR_mask-all-active-nms_pageviewer

So all these refinements seem crucial.

But it appears that this model was trained on highly overlapping regions – which makes it next to impossible to avoid these overlaps during prediction. And an equally serious problem seems to be the nature of the applied classification: Footnotes just are not visually differentiable from other text regions (only textually/logically) – so they'll just usurp all the energy of their look-alikes. IMHO an adequate modelling treats this subclassification as secondary task.

Hence, inevitably, we need to retrain this.

@n00blet @mahmed1995 @khurramHashmi @mjenckel can you please provide details about the training procedure and dataset you used? There's virtually nothing about this in the OCR-D reader, and your final DFG presentation poster only references one paper on page frame detection and one on dewarping. Am I correct in assuming this repo is where your training tools reside?