Djvu to Page-xml
Closed this issue · 3 comments
@chris1010010 Will you consider supporting converting of djvu to page-xml
I'm an expert in Djvu. But if it's straightforward, we will consider it. What Djvu data are you working with?
@chris1010010
Note: I am welling to make humble donations/ pay, to implement this feature.
Some djvu data sources:
- The Internet Archive contains a vast library of djvu content.
- The Library of Alexandria contains story books in djvu.
- the list goes on
Statement:
So, in order to train document layout analysis & an ocr recognition models, alot of training-data and groundtruth are needed. Currently, djvu is well developed and great, but somehow it's not widely adopted among the github developers.
Conclusion:
- Having Djvu to JPG+Page-XML will allow to create a large training dataset rapidly, which can be used to train p2pala, dhsegment, calamari, etc...
- You can even use the well developed and established pdf2djvu to convert pdf>djvu>page-xml.
2 birds, 1 stone
Waiting for your reply.
Seems to be a bigger project, wasn't aware the images are encoded in DJVU.
Perhaps these projects can help us:
https://github.com/gthurm/javadjvu
http://djvu.sourceforge.net/