The scripts/code used to match the PDF miner outputs on documents to the XML representations

Question

The scripts/code used to match the PDF miner outputs on documents to the XML representations

abirami005 opened this issue 5 years ago · 7 comments

Do you provide the scripts/code that you developed to match the PDFMiner outputs on the documents to the XML representation of the PDF page itself? Thanks

Answer 1 · 2020-02-27T21:42:17.000Z

We cannot open source the code at the moment as it is related to our IP protection.

Answer 2 · 2020-03-02T08:26:21.000Z

We cannot open source the code at the moment as it is related to our IP protection.

Then how about publishing the alignment data themselves in some form?

Answer 3 · 2020-03-02T21:42:53.000Z

We cannot open source the code at the moment as it is related to our IP protection.

Then how about publishing the alignment data themselves in some form?

Em, I did not think of it before. Let me have a check along our legal approval chain.

Answer 4 · 2020-03-05T11:48:22.000Z

I assume this means that providing only the code for extracting annotations from XML representation is also not possible at the moment?

Answer 5 · 2020-03-05T23:48:59.000Z

@pollyMath Unfortunately that is what our IP lawyer told us.

Answer 6 · 2021-01-11T16:48:44.000Z

We cannot open source the code at the moment as it is related to our IP protection.

Then how about publishing the alignment data themselves in some form?

Em, I did not think of it before. Let me have a check along our legal approval chain.

@zhxgj Did your lawyers reach a verdict regarding the publication of PDF/XML alignment data?

Note: This is relevant to a number of potential applications of this corpus, for which some choices made in the COCO format would be incompatible or suboptimal, e.g.

definition/granularity of region classes
not annotating headers and footers
not including reading order of regions
not including text lines (contours / baselines)
not including text content (plain) and text style (formatting)

Answer 7 · 2021-01-12T23:05:46.000Z

Unfortunately not yet. I understand the benefits, but we cannot release it yet. Thanks for your understanding.

…

On Tue, Jan 12, 2021 at 3:49 AM Robert Sachunsky ***@***.***> wrote: We cannot open source the code at the moment as it is related to our IP protection. Then how about publishing the alignment data themselves in some form? Em, I did not think of it before. Let me have a check along our legal approval chain. @zhxgj <https://github.com/zhxgj> Did your lawyers reach a verdict regarding the publication of PDF/XML alignment data? Note: This is relevant to a number of potential applications of this corpus, for which some choices made in the COCO format would be incompatible or suboptimal, e.g. - definition/granularity of region classes - not annotating headers and footers - not including reading order of regions - not including text lines (contours / baselines) - not including text content (plain) and text style (formatting) — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#20 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AA6BZDOMQJ545RQ35QSAHDLSZMTXZANCNFSM4K34F7UA> .