Reading contents of a PDF

Question

Reading contents of a PDF

santiagomed opened this issue a year ago · 9 comments

Is there an example on how to simply read the contents of a PDF successfully? I tried looking into read.rs but it seems to be outdated so I can't run it. Any way to read a PDF?

Answer 1 · 2023-09-17T01:42:46.000Z

What content do you want?
There is a lot in there.

Content stream? You can get that from the page object.
Text? See the pdf_render and pdf_text crates.

You can use the pdf crate in two version:

from crates.io, then use the example that match it: https://github.com/pdf-rs/pdf/tree/a6e2abc96b23b64aa1051966bb000aabf1275d9f
master with the latest fixes.

The pdf_render and pdf_text crates only work with the latest master.

Answer 2 · 2023-12-08T15:27:37.000Z

What are the pdf_render and pdf_text crates ? crates.io doesn't know anything about that.

Answer 3 · 2023-12-08T16:21:07.000Z

They are not on crates.io because they do not meet my stability requirements for publishing there.
pdf_render … renders pdfs.
pdf_text extracts text.

Answer 4 · 2024-05-07T19:56:00.000Z

pdf-extract crate exists, but depends on lopdf, not pdf. This video benchmarks it against poppler, a C library.

I'd be curious to see a C/Rust comparison but with poppler against pdf_text.