pdf-rs/pdf

Reading contents of a PDF

santiagomed opened this issue · 9 comments

Is there an example on how to simply read the contents of a PDF successfully? I tried looking into read.rs but it seems to be outdated so I can't run it. Any way to read a PDF?

s3bk commented

What content do you want?
There is a lot in there.

  • Content stream? You can get that from the page object.
  • Text? See the pdf_render and pdf_text crates.

You can use the pdf crate in two version:

The pdf_render and pdf_text crates only work with the latest master.

vjau commented

What are the pdf_render and pdf_text crates ? crates.io doesn't know anything about that.

s3bk commented

They are not on crates.io because they do not meet my stability requirements for publishing there.
pdf_render … renders pdfs.
pdf_text extracts text.

pdf-extract crate exists, but depends on lopdf, not pdf. This video benchmarks it against poppler, a C library.

I'd be curious to see a C/Rust comparison but with poppler against pdf_text.