Support for Extracting PDF Content as XML

Question

Support for Extracting PDF Content as XML

Closed this issue 11 hours ago · 7 comments

Hi, I’d like to use Extractous for my document processing tasks. I often need to extract PDF content as XML to retain structural information, such as page boundaries. This is a feature supported by Apache Tika, but it seems that currently, Extractous only provides plain text extraction.

Would it be possible to add support for XML extraction, similar to Tika’s functionality? This feature would be incredibly useful for preserving document structure.

Thank you for considering this request!

Answer 1 · 2024-11-25T17:13:58.000Z

Thanks for reporting this, we are working on this. Will update this issue when we have a working implementation.

Answer 2 · 2024-12-04T13:12:04.000Z

Thanks for mentioning this project to me over on Reddit. I'll definitely consider integrating it into txtai as another text extraction engine once this change is in.

I've long thought Tika is a good solution but the Java piece trips a lot of people up.

Answer 3 · 2024-12-04T13:42:20.000Z

Thanks @davidmezzetti, we can definitely assist with the integration. It was always on our plan to work on integrations with other frameworks such as txtai. At the moment we are focusing on supporting most expected Tika features (including xml output). Then we can move onto integrations.
We'll get in touch once this is in ..

Answer 4 · 2024-12-04T20:22:24.000Z

Sounds good. I should be able to add an integration fairly easily, ~20-30 lines with txtai once you have this change. I'll keep an eye on this!

Answer 5 · 2024-12-18T09:05:15.000Z

Hi @nmammeri,
do you have any updates on the progress of XML extraction support?
Thanks again for your work on this!

Answer 6 · 2024-12-21T10:18:48.000Z

Hi @davidmezzetti and @coroluca I'm glad to announce that we finally got the xml output feature in. Please check version 0.3.0 🎉 . thanks for your patience

Answer 7 · 2024-12-21T10:43:47.000Z

Great work! I'll take a look.