yobix-ai/extractous

Support for Extracting PDF Content as XML

Closed this issue · 7 comments

Hi, I’d like to use Extractous for my document processing tasks. I often need to extract PDF content as XML to retain structural information, such as page boundaries. This is a feature supported by Apache Tika, but it seems that currently, Extractous only provides plain text extraction.

Would it be possible to add support for XML extraction, similar to Tika’s functionality? This feature would be incredibly useful for preserving document structure.

Thank you for considering this request!

Thanks for reporting this, we are working on this. Will update this issue when we have a working implementation.

Thanks for mentioning this project to me over on Reddit. I'll definitely consider integrating it into txtai as another text extraction engine once this change is in.

I've long thought Tika is a good solution but the Java piece trips a lot of people up.

Thanks @davidmezzetti, we can definitely assist with the integration. It was always on our plan to work on integrations with other frameworks such as txtai. At the moment we are focusing on supporting most expected Tika features (including xml output). Then we can move onto integrations.
We'll get in touch once this is in ..

Sounds good. I should be able to add an integration fairly easily, ~20-30 lines with txtai once you have this change. I'll keep an eye on this!

Hi @nmammeri,
do you have any updates on the progress of XML extraction support?
Thanks again for your work on this!

Hi @davidmezzetti and @coroluca I'm glad to announce that we finally got the xml output feature in. Please check version 0.3.0 🎉 . thanks for your patience

Great work! I'll take a look.