ibm-aur-nlp/PubLayNet

The scripts/code used to match the PDF miner outputs on documents to the XML representations

abirami005 opened this issue · 7 comments

Do you provide the scripts/code that you developed to match the PDFMiner outputs on the documents to the XML representation of the PDF page itself? Thanks

zhxgj commented

We cannot open source the code at the moment as it is related to our IP protection.

We cannot open source the code at the moment as it is related to our IP protection.

Then how about publishing the alignment data themselves in some form?

zhxgj commented

We cannot open source the code at the moment as it is related to our IP protection.

Then how about publishing the alignment data themselves in some form?

Em, I did not think of it before. Let me have a check along our legal approval chain.

I assume this means that providing only the code for extracting annotations from XML representation is also not possible at the moment?

zhxgj commented

@pollyMath Unfortunately that is what our IP lawyer told us.

We cannot open source the code at the moment as it is related to our IP protection.

Then how about publishing the alignment data themselves in some form?

Em, I did not think of it before. Let me have a check along our legal approval chain.

@zhxgj Did your lawyers reach a verdict regarding the publication of PDF/XML alignment data?

Note: This is relevant to a number of potential applications of this corpus, for which some choices made in the COCO format would be incompatible or suboptimal, e.g.

  • definition/granularity of region classes
  • not annotating headers and footers
  • not including reading order of regions
  • not including text lines (contours / baselines)
  • not including text content (plain) and text style (formatting)