Add as transform script to ocr-fileformat?

Question

Add as transform script to ocr-fileformat?

kba opened this issue 5 years ago · 4 comments

Wouldn't this be more versatile if it were integrated into ocr-fileformat / ocrd_fileformat ?

Answer 1 · 2020-04-16T16:52:54.000Z

That's difficult to do with the various parameters offered here, though:

hierarchy level for text
hierarchy level for outlines
font

These arise naturally when downgrading from an annotation to a presentation format, so the other converters don't share this.

If we flatten these into ocrd_fileformat's script-args, we loose all type checking and descriptions. (And if we keep them as top-level params, they will look ugly/ambiguous on the other conversions.)

Moreover, and even more difficult, as soon as this supports multi-page output (which would be fantastic), then the whole ~~interface~~ behaviour looks much different than ocrd_fileformat's.

Related: OCR-D/spec#142

Answer 2 · 2020-04-16T17:11:04.000Z

It's similar to the {PAGE,ALTO,hOCR}-to-TEI transformations in that one format is page-based and the other document-based. There is no one way to do this and the individual transformations will require a level of adaptions that make a level of abstraction like ocr-fileformat inconvenient at some point. That's why ocr-fileformat has such a simple interface because for advanced use cases, it is much easier to just take the XSLT/script and adapt it.

If we had the pagetopdf as part of ocr-fileformat, we could at least support the use case "1 PAGE-XML -> 1 PDF" and then merge the results. Not terribly efficient but a good baseline.

How about integration into ocr-fileformat for the simple use case now and later investing time into getting ocrd_pagetopdf right with support for all the features the PRIMA tool offers?

Answer 3 · 2020-04-16T17:36:26.000Z

It's similar to the {PAGE,ALTO,hOCR}-to-TEI transformations

I can only see TEI-to-hocr in ocr-fileformat. Maybe these are still in planning/development?

There is no one way to do this and the individual transformations will require a level of adaptions that make a level of abstraction like ocr-fileformat inconvenient at some point.

Okay, so your concern is with availability/uniformity first, whereas flexibility should come later or in other wrappers. I kind of agree...

If we had the pagetopdf as part of ocr-fileformat, we could at least support the use case "1 PAGE-XML -> 1 PDF" and then merge the results. Not terribly efficient but a good baseline.

Yes, that makes sense. Since @JKamlah is going to do multi-page via gs/pdfjoin post-processing anyway, this can be separated naturally.

But what about the totally non-obvious choices for the above mentioned parameters?

hierarchy level for text: I guess we could assume the OCR-D text hierarchy consistency principle is honoured and just pick the top-most available level?
hierarchy level for outlines: This would have to become a (less-documented, non-enum) script-arg. Probably only valuable for debugging anyway.
font: We could try to be fully automatic by doing something like Tesseract's text2image --find_fonts.

How about integration into ocr-fileformat for the simple use case now and later investing time into getting ocrd_pagetopdf right with support for all the features the PRIMA tool offers?

Oh, so you want to go both ways, keeping this repo distinct.

Answer 4 · 2020-04-17T12:54:17.000Z

I can only see TEI-to-hocr in ocr-fileformat. Maybe these are still in planning/development?

At least page2tei is planned. My point was that to-TEI-transformations suffer from the same page-vs-document-orientation-problem. We have discussed this in OCR-D and also for the DHd OCR working group and decided to focus on the simple use case first, page-wise transformation to have a valid but not terribly useful TEI document.

hierarchy level for text:

Yes, we assume input to be valid OCR-D-conformant PAGE, hence the top-most available level.

hierarchy level for outlines

To get this right is probably beyond ocr-fileformat's capabilities.

font: We could try to be fully automatic by doing something like Tesseract's text2image --find_fonts.

Interesting, didn't know that, thanks. I cannot see how this can be implemented in a ocr-fileformat wrapper without requiring a lot of parameters (font search path etc). @JKamlah If this can be hidden away in a script and if we use that script as entry point to pagetopdf this should be doable though, right?

How about integration into ocr-fileformat for the simple use case now and later investing time into getting ocrd_pagetopdf right with support for all the features the PRIMA tool offers?

Oh, so you want to go both ways, keeping this repo distinct.

ocrd_fileformat integration as the quick and dirty solution for page-wise transformation, a dedicated processor for document-oriented processing with additional parameters, support for outlines etc. We have limited resources, so focussing on the simple use cases first sems wise. Plus, ocr-fileformat is not OCR-D-specific, so others could profit from integration.