PDF to Page-xml

Question

PDF to Page-xml

mrocr opened this issue 6 years ago · 7 comments

Will you consider the ability to convert a PDF file to a Page-xml file..

Answer 1 · 2019-05-18T10:31:45.000Z

A conversion would not be easy. At the moment we don't have the resources for this. The only possible way is to convert to an image and run TesseractToPage (on our website) to OCR

Answer 2 · 2019-05-18T13:42:40.000Z

Thanks for your consideration

Answer 3 · 2020-01-28T12:15:50.000Z

https://github.com/PRImA-Research-Lab/prima-page-to-pdf ?

Answer 4 · 2020-01-28T12:49:59.000Z

That's the opposite ;-)

Answer 5 · 2020-01-28T13:52:47.000Z

That's the opposite ;-)

True, but wouldn't it a possible workflow:

pdftoimages
[preprocessing]
ocrd-tesserocr-recognize
prima-page-to-pdf to sandwich text and results?

Really curious, haven't had the time yet to deal with this, but it's certainly a desired feature for many users. Many libraries also offer a bulk PDF download which is easier to scrape than the mets.xml (if users even know about that option).

Answer 6 · 2020-01-28T17:11:55.000Z

Yes, that's possible of course (see my entry from May). But you throw away a lot of information (the text) and the results will only be as good as Tesseract

Answer 7 · 2020-04-07T13:43:17.000Z

Maybe have a look at the PageXmlTextExporter class in PdfPig (in C#). See the wiki for more info. It's still an early version but might help...