Add as transform script to ocr-fileformat?
kba opened this issue · 4 comments
Wouldn't this be more versatile if it were integrated into ocr-fileformat / ocrd_fileformat ?
That's difficult to do with the various parameters offered here, though:
- hierarchy level for text
- hierarchy level for outlines
- font
These arise naturally when downgrading from an annotation to a presentation format, so the other converters don't share this.
If we flatten these into ocrd_fileformat's script-args
, we loose all type checking and descriptions. (And if we keep them as top-level params, they will look ugly/ambiguous on the other conversions.)
Moreover, and even more difficult, as soon as this supports multi-page output (which would be fantastic), then the whole interface behaviour looks much different than ocrd_fileformat's.
Related: OCR-D/spec#142
It's similar to the {PAGE,ALTO,hOCR}-to-TEI transformations in that one format is page-based and the other document-based. There is no one way to do this and the individual transformations will require a level of adaptions that make a level of abstraction like ocr-fileformat inconvenient at some point. That's why ocr-fileformat has such a simple interface because for advanced use cases, it is much easier to just take the XSLT/script and adapt it.
If we had the pagetopdf as part of ocr-fileformat, we could at least support the use case "1 PAGE-XML -> 1 PDF" and then merge the results. Not terribly efficient but a good baseline.
How about integration into ocr-fileformat for the simple use case now and later investing time into getting ocrd_pagetopdf right with support for all the features the PRIMA tool offers?
It's similar to the {PAGE,ALTO,hOCR}-to-TEI transformations
I can only see TEI-to-hocr in ocr-fileformat. Maybe these are still in planning/development?
There is no one way to do this and the individual transformations will require a level of adaptions that make a level of abstraction like ocr-fileformat inconvenient at some point.
Okay, so your concern is with availability/uniformity first, whereas flexibility should come later or in other wrappers. I kind of agree...
If we had the pagetopdf as part of ocr-fileformat, we could at least support the use case "1 PAGE-XML -> 1 PDF" and then merge the results. Not terribly efficient but a good baseline.
Yes, that makes sense. Since @JKamlah is going to do multi-page via gs/pdfjoin post-processing anyway, this can be separated naturally.
But what about the totally non-obvious choices for the above mentioned parameters?
- hierarchy level for text: I guess we could assume the OCR-D text hierarchy consistency principle is honoured and just pick the top-most available level?
- hierarchy level for outlines: This would have to become a (less-documented, non-enum)
script-arg
. Probably only valuable for debugging anyway. - font: We could try to be fully automatic by doing something like Tesseract's
text2image --find_fonts
.
How about integration into ocr-fileformat for the simple use case now and later investing time into getting ocrd_pagetopdf right with support for all the features the PRIMA tool offers?
Oh, so you want to go both ways, keeping this repo distinct.
I can only see TEI-to-hocr in ocr-fileformat. Maybe these are still in planning/development?
At least page2tei is planned. My point was that to-TEI-transformations suffer from the same page-vs-document-orientation-problem. We have discussed this in OCR-D and also for the DHd OCR working group and decided to focus on the simple use case first, page-wise transformation to have a valid but not terribly useful TEI document.
hierarchy level for text:
Yes, we assume input to be valid OCR-D-conformant PAGE, hence the top-most available level.
hierarchy level for outlines
To get this right is probably beyond ocr-fileformat's capabilities.
font: We could try to be fully automatic by doing something like Tesseract's
text2image --find_fonts
.
Interesting, didn't know that, thanks. I cannot see how this can be implemented in a ocr-fileformat wrapper without requiring a lot of parameters (font search path etc). @JKamlah If this can be hidden away in a script and if we use that script as entry point to pagetopdf this should be doable though, right?
How about integration into ocr-fileformat for the simple use case now and later investing time into getting ocrd_pagetopdf right with support for all the features the PRIMA tool offers?
Oh, so you want to go both ways, keeping this repo distinct.
ocrd_fileformat integration as the quick and dirty solution for page-wise transformation, a dedicated processor for document-oriented processing with additional parameters, support for outlines etc. We have limited resources, so focussing on the simple use cases first sems wise. Plus, ocr-fileformat is not OCR-D-specific, so others could profit from integration.