UB-Mannheim/ocr-fileformat

Support conversion from and to Textract JSON

scottschreckengaust opened this issue · 4 comments

Textract has an output results format in JSON.

https://docs.aws.amazon.com/textract/latest/dg/textract-dg.pdf

Specifically, the three types of analysis, https://docs.aws.amazon.com/textract/latest/dg/how-it-works-analyzing.html for the categories:

  1. text,
  2. forms, and
  3. tables

Conversion from Textract to PAGE XML was now added with pull request #160.

Alas, the new converter is still incomplete, so

  • forms, and
  • tables

do not work yet. See slub/textract2page#2

Update: tables work now, but the converter submodule needs to be updated here

kba commented

Update: tables work now, but the converter submodule needs to be updated here

I've updated the vendor submodules, including textract2page in #166. The tables branch is not yet merged to master though and I think there are files missing to properly run the tests.