aws-samples/amazon-textract-response-parser

Support Layout API responses in TRP.js

athewsey opened this issue · 4 comments

Amazon Textract has launched a new LAYOUT analysis capable of returning a range of layout features such as titles, paragraphs, headers, and footers.

Today TRP.js provides basic heuristic options for sorting text (paragraphs) into reading order and segmenting headers and footers from main content.

We should extend TRP.js to make use of Amazon Textract's native layout analysis where possible, and maybe(?) keep these old heuristic methods around in case users want to continue using them to save on API costs or re-ingestion.

Not sure when I'll have chance to look at this yet, but raising here to reflect that it's on our radar. If you're waiting on this feature or have particular feedback on how you'd like it to work in the JS/TS version of TRP, please do let us know!

pags commented

I'd love the ability to take the raw JSON from the Textract API response provided by LAYOUT and turn it into an ordered csv like the one returned from the bulk document processor using the Textract console. As far as I can tell, using the console is the only way to get the stitched-together layout data.

This feature would be hugely beneficial to us as we are building out functionality to convert pdf's into our forms using Textract and need the LAYOUT analysis to interpret how inputs relate to each other