PRImA-Research-Lab/prima-page-converter

Djvu to Page-xml

Closed this issue · 3 comments

mrocr commented

@chris1010010 Will you consider supporting converting of djvu to page-xml

I'm an expert in Djvu. But if it's straightforward, we will consider it. What Djvu data are you working with?

mrocr commented

@chris1010010
Note: I am welling to make humble donations/ pay, to implement this feature.

Some djvu data sources:

Statement:
So, in order to train document layout analysis & an ocr recognition models, alot of training-data and groundtruth are needed. Currently, djvu is well developed and great, but somehow it's not widely adopted among the github developers.

Conclusion:

  • Having Djvu to JPG+Page-XML will allow to create a large training dataset rapidly, which can be used to train p2pala, dhsegment, calamari, etc...
  • You can even use the well developed and established pdf2djvu to convert pdf>djvu>page-xml.
    2 birds, 1 stone

Waiting for your reply.

Seems to be a bigger project, wasn't aware the images are encoded in DJVU.

Perhaps these projects can help us:
https://github.com/gthurm/javadjvu
http://djvu.sourceforge.net/