PDF to Page-xml
mrocr opened this issue · 7 comments
Will you consider the ability to convert a PDF file to a Page-xml file..
A conversion would not be easy. At the moment we don't have the resources for this. The only possible way is to convert to an image and run TesseractToPage (on our website) to OCR
Thanks for your consideration
That's the opposite ;-)
That's the opposite ;-)
True, but wouldn't it a possible workflow:
pdftoimages
[preprocessing]
ocrd-tesserocr-recognize
prima-page-to-pdf to sandwich text and results?
Really curious, haven't had the time yet to deal with this, but it's certainly a desired feature for many users. Many libraries also offer a bulk PDF download which is easier to scrape than the mets.xml (if users even know about that option).
Yes, that's possible of course (see my entry from May). But you throw away a lot of information (the text) and the results will only be as good as Tesseract