PRImA-Research-Lab/prima-page-converter

PDF to Page-xml

mrocr opened this issue · 7 comments

mrocr commented

Will you consider the ability to convert a PDF file to a Page-xml file..

A conversion would not be easy. At the moment we don't have the resources for this. The only possible way is to convert to an image and run TesseractToPage (on our website) to OCR

mrocr commented

Thanks for your consideration

That's the opposite ;-)

kba commented

That's the opposite ;-)

True, but wouldn't it a possible workflow:

pdftoimages
[preprocessing]
ocrd-tesserocr-recognize
prima-page-to-pdf to sandwich text and results?

Really curious, haven't had the time yet to deal with this, but it's certainly a desired feature for many users. Many libraries also offer a bulk PDF download which is easier to scrape than the mets.xml (if users even know about that option).

Yes, that's possible of course (see my entry from May). But you throw away a lot of information (the text) and the results will only be as good as Tesseract

BobLd commented

Maybe have a look at the PageXmlTextExporter class in PdfPig (in C#). See the wiki for more info. It's still an early version but might help...