qurator-spk/ocrd_repair_inconsistencies

Move to ocrd-segment-repair?

Opened this issue ยท 8 comments

@bertsky wrote in #1:

I still think this would make a very good addition to ocrd-segment-repair...

Yes, I think so too. It was unclear what exactly ocrd-segment-repair would do to my files other than my hypothetically added re-ordering operation. If ocrd-segment-repair is going down the "let the user choose a single operation" road, I'm happy to add this as one of those single operations.

To explain: I needed this to fix problems with some hundred ground truth files. As I wanted to be careful with my ground truth files I wanted to exactly fix this problem, nothing more. Therefore I wrote a separate script and did not add the operation to ocrd-segment-repair.

Yes, there's definitely going to be fine grained control of what checks and repair heuristics to use for ocrd-segment-repair. Let's delay this until we have baked ocrd-segment-evaluate (PRImA tools re-implementation) and found ourselves some useful module + data structures.

Agreed.

kba commented

Shall we include this in ocrd_all or wait until you've decided whether/how to integrate with ocrd_segment?

Shall we include this in ocrd_all or wait until you've decided whether/how to integrate with ocrd_segment?

I'd say now is as good a time as ever for ocrd_all. (We want to give users the best possible processing options.)

cneud commented

Since this is very OCR-D specific stuff, I would actually prefer this moved to ocrd-segment-repair at some point.

Since this is very OCR-D specific stuff, I would actually prefer this moved to ocrd-segment-repair at some point.

Sure, but see above โ€“ nothing has changed from ocrd_segment's side so far. As soon as we have a good library structure there and self-explaining and orthogonal repair processors/parameters, I'll address having ocrd-repair-inconsistencies flow into it. Segment re-ordering is also connected to layout evaluation (projected in ocrd-segment-evaluate) and to validation auto-repair hooks (as currently planned for coordinates) or auto-repair instrumentation (also projected for coordinates), so we first have to shake everything else together.

As I've closed #8 (Find a better name) in favor of merging it into some other tool: I suggest a very specific operation name of reorder-segments-to-match-parent-text in the future.