MaxHalford/maxhalford.github.io

blog/textract-table-to-pandas/

utterances-bot opened this issue · 6 comments

Converting Amazon Textract tables to pandas DataFrames - Max Halford

I’m currently doing a lot of document processing at work. One of my tasks is to extract tables from PDF files. I evaluated Amazon Textract’s table extraction capability as part of this task. It’s very well documented, as is the rest of Textract. I was slightly disappointed by the examples, but nothing serious.
I wanted to write this short blog post to share a piece of code I use to convert tables extracted through Amazon Textract to pandas.

https://maxhalford.github.io/blog/textract-table-to-pandas/

thanks for the great article! I am curious which templates work well with textract and which don't? I am also interested in how you do it manually and the type of clustering algorithms you use! That would be of much help. Thanks :)

Hey there @heathervant!

I am curious which templates work well with textract and which don't?

In my experience, Textract doesn't work well when the table is surrounded by a frame of some sort. I imagine that they have some border detection method, and the fact that the table is contained in a frame that has borders is confusing to the algorithm.

I am also interested in how you do it manually and the type of clustering algorithms you use!

I directly process the raw annotations returned by Textract or Google Vision's OCR. I build a distance matrix that only uses the y (vertical) coordinate of each annotation. I also manually set the distance to +∞ if the annotations have similar x (horizontal) coordinates. I've had the best results by using scikit-learn's AgglomerativeClustering and setting the linkage parameter to 'complete'. This makes a lot of sense because the way agglomerative clustering works. It will start by merging annotations that are on the same row because their y coordinates are close. Then, it will attempt to merge rows together, but will stop because it will see that the rows share annotations that have the same x coordinates. I'll write a separate blog post on this if there's interest :)

hi @MaxHalford thanks that is super interesting! I would definitely be interested in a blog post on how you use agglomerative Clustering to read data tables and classify pieces of documents. Are you able to extract headers and bullet points and numbered lists with this method or any other method? Very interesting work, I appreciate you sharing.

No worries! I'll write a post sometime during the next couple of months. Once you have extracted lines, you can extract concepts, such as headers and lists, via some regex logic. At least, this has worked well for me on my usecases. Extracting lines is really the fundamental building block :). Hope that makes sense.

This is very helpful

Thanks. Helped me customize my Textract outputs fast.