Semi-automatic structuring in metadata editor using OCR
Opened this issue · 4 comments
In Kitodo.Production repository a detailed ticket for integrating the automatic structuring was created.
@markusweigelt The linked feature sounds great. But i think it would be very useful to have the following functionality - in a way also the basis of the described quite fancy functionality -:
- show the recognized OCR as plain text inside of Kitodo (by parsing the ALTO and filtering out the XML tags)
- enable the OCR of a single page or multiple pages from Kitodo (from the metadata editor)
- give an indication in the Kitodo editor which pages are OCR processed and which are not
The main use case for me would be to allow the OCR to be done at the beginning of a workflow. Even before people have done some quality assurance (missing pages etc.). So that the OCR does not have to wait. And to allow people to use the OCR results while structring. And if people then do corrections in Kitodo enable the OCR only for newly added pages for example.
I am not quite sure if those features could be adressed in the KITODO-OCRD-project or wether they are something for the Kitodo development fund, what do you think?
Most of the things which kitodo/kitodo-production#5476 describes are new Kitodo UI features – out of scope for our OCR-D integration project, so yes, that would mean Kitodo development fund.
- show the recognized OCR as plain text inside of Kitodo (by parsing the ALTO and filtering out the XML tags)
What we can do here is previewing OCR results with OCR-D browser.
On the Kitodo side, for the intended extension, I think you're right – a simple plain text editor would suffice (one line per TextLine
with all its ./String/@CONTENT
concatenated).
- enable the OCR of a single page or multiple pages from Kitodo (from the metadata editor)
Already possible (see --page-id
option for for_production.sh
and for_presentation.sh
). The syntax is explained here (notice multi-value
/ range
/ regex
support).
- give an indication in the Kitodo editor which pages are OCR processed and which are not
That's also something we (as integration project) have little control over, since it's a genuine UI feature. All we can do is ensure the filesystem side (FULLTEXT
subdirectory and file names) fits Kitodo's conventions.
The main use case for me would be to allow the OCR to be done at the beginning of a workflow. Even before people have done some quality assurance (missing pages etc.). So that the OCR does not have to wait. And to allow people to use the OCR results while structring. And if people then do corrections in Kitodo enable the OCR only for newly added pages for example.
Yes, these are valid use-cases, too. But renaming pages adds the difficulty of ensuring consistency (as long as OCR is still running). I'll try to reformulate under kitodo/kitodo-production#5476.
For ocrd_kitodo IMO we can already close (as it's already supported from our side).
For ocrd_kitodo IMO we can already close (as it's already supported from our side).
Except perhaps the feature that we should skip pages which have already been processed earlier (an ALTO file exists).
Great, thanks for your detailed answer!