block segmentation: overlaps and quality of prebuilt models
bertsky opened this issue · 0 comments
Once I got the block segmentation to actually run, I was puzzled over the extremely bad results of the provided model.
Here's how I gradually worked to isolate the problem.
- using default 0.9 confidence threshold:
a | b |
---|---|
- using lower 0.5 confidence threshold:
a | b |
---|---|
- using default 0.9 confidence threshold, but annotating a polygon from the mask:
a | b |
---|---|
- using lower 0.5 confidence threshold, but annotating a polygon from the mask:
a | b |
---|---|
- using lower 0.5 confidence threshold, but annotating a polygon from the mask, and doing non-maximum suppression and other post-processing (like checking for containment):
a | b |
---|---|
- using even lower 0.02 confidence threshold, but annotating a polygon from the mask, and suppressing the classes
header
,footer
,footnote
,footnote-continued
,endnote
,keynote
(reserving their probability mass):
a | b |
---|---|
- using even lower 0.02 confidence threshold, but annotating a polygon from the mask, and suppressing the classes
header
,footer
,footnote
,footnote-continued
,endnote
,keynote
(reserving their probability mass), and doing non-maximum suppression and other post-processing (like checking for containment):
a | b |
---|---|
So all these refinements seem crucial.
But it appears that this model was trained on highly overlapping regions – which makes it next to impossible to avoid these overlaps during prediction. And an equally serious problem seems to be the nature of the applied classification: Footnotes just are not visually differentiable from other text regions (only textually/logically) – so they'll just usurp all the energy of their look-alikes. IMHO an adequate modelling treats this subclassification as secondary task.
Hence, inevitably, we need to retrain this.
@n00blet @mahmed1995 @khurramHashmi @mjenckel can you please provide details about the training procedure and dataset you used? There's virtually nothing about this in the OCR-D reader, and your final DFG presentation poster only references one paper on page frame detection and one on dewarping. Am I correct in assuming this repo is where your training tools reside?